Yen vs Path Separator in Japan

Discussion:

Yen vs Path Separator in Japan

(too old to reply)

Ben Bryant

2005-10-15 19:17:29 UTC

I think I understand most of this issue in terms of Michael Kaplan's
excellent postings in:
Whats up with the Korean (Unicode) sort?
http://blogs.msdn.com/michkap/archive/2004/12/14/284838.aspx
and When is a backslash not a backslash?
http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx
and others on TheOldNewThing and Larry Osterman

I don't know how they use their keyboards in Japan, but I am assuming they
have a way of typing a yen sign and they use the same thing for a path
separator when typing a pathname into an edit box. Assuming it is a Unicode
edit box, how does the edit box (or really the Windows OS in generating the
keydown message) decide which code point to use for the Yen sign U+005c or
U+00a5? I think you would not have this question in code page 932 since it
would be 5c.

Okay, now assuming it is U+005c in order to support Unicode pathnames, would
there be any way other than processing of the edit box text value to change
them to U+00a5? I could image a flag indicating "treat yens as U+00a5" in
this edit box, or "treat yens as U+005c" in this other edit box depending on
whether it is for regular text including currency discussions, or
Windows/DOS pathnames.

Thanks,
Ben
http://codesnipers.com/?q=blog/3

Michael (michka) Kaplan [MS]

2005-10-15 21:18:43 UTC

If you keep it all in Unicode, you can choose which one to use.

If you do not, then you will not be able to.

The key is to keep it in Unicode and be done with it. Any other attempt at a
solution will not happen....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
I think I understand most of this issue in terms of Michael Kaplan's
Whats up with the Korean (Unicode) sort?
http://blogs.msdn.com/michkap/archive/2004/12/14/284838.aspx
and When is a backslash not a backslash?
http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx
and others on TheOldNewThing and Larry Osterman
I don't know how they use their keyboards in Japan, but I am assuming they
have a way of typing a yen sign and they use the same thing for a path
separator when typing a pathname into an edit box. Assuming it is a Unicode
edit box, how does the edit box (or really the Windows OS in generating the
keydown message) decide which code point to use for the Yen sign U+005c or
U+00a5? I think you would not have this question in code page 932 since it
would be 5c.
Okay, now assuming it is U+005c in order to support Unicode pathnames, would
there be any way other than processing of the edit box text value to change
them to U+00a5? I could image a flag indicating "treat yens as U+00a5" in
this edit box, or "treat yens as U+005c" in this other edit box depending on
whether it is for regular text including currency discussions, or
Windows/DOS pathnames.
Thanks,
Ben
http://codesnipers.com/?q=blog/3

Ben Bryant

2005-10-15 21:31:06 UTC

Yes, the goal is to keep it all in Unicode, but that doesn't answer the
question of which one it comes through keydown as. Or does the user specify
which one it is via different keystrokes on the keyboard?

Michael (michka) Kaplan [MS]

2005-10-16 01:05:50 UTC

On the Japanese keyboard?

Depends on the IME, but I assume U+005c -- as I said in the post, path
separators are more important than money.
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
Yes, the goal is to keep it all in Unicode, but that doesn't answer the
question of which one it comes through keydown as. Or does the user specify
which one it is via different keystrokes on the keyboard?

Norman Diamond

2005-10-17 00:59:11 UTC

Post by Ben Bryant
Yes, the goal is to keep it all in Unicode,

Even Windows itself doesn't always do that. Last I saw, to get controls in
forms to use Unicode instead of multibyte in their internal operations, some
sufficiently recent version of Office had to be installed and programmers
are not allowed to redistribute those versions of the controls to customers
who don't have sufficiently recent versions of Office.

Post by Ben Bryant
but that doesn't answer the question of which one it comes through keydown
as.

For a keydown method written by an ordinary programmer, if the compilation
environment was ANSI then I'd expect 0xC5 (single byte), and if the
compilation environment was Unicode then, because path separators are more
important than money (how many path separators did you pay for your MSDN
subscription?) I'd also expect U+005c.

Post by Ben Bryant
Or does the user specify which one it is via different keystrokes on the
keyboard?

For historical reasons the standard keyboard has both a yen key (whose
shifted character is an or-bar) and a backslash key (whose shifted character
is an underscore). Obviously they have different scan codes in hardware.
But when translated to JIS-Romaji for single-byte input, they both yield the
yen sign, 0x5C. There is no single-byte backslash character in Japanese
standard character sets, but apparently some antique IBM character set had
one.

Michael (michka) Kaplan [MS]

2005-10-17 07:10:09 UTC

Norman, Norman, Norman....

Post by Norman Diamond
Even Windows itself doesn't always do that. Last I saw, to get controls
in forms to use Unicode instead of multibyte in their internal operations,
some sufficiently recent version of Office had to be installed and
programmers are not allowed to redistribute those versions of the controls
to customers who don't have sufficiently recent versions of Office.

That is VB and VBA. It is certainly not Windows, which supports Unicode just
fine.

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Richard Lewis Haggard

2005-10-22 02:45:53 UTC

I would have thought that the difference was not in the character, but in
how it was used. If the yen key was typed into an edit that was destined to
be interpreted as a path, as in explorer, then it is a path delimiter. If it
was entered into an edit whose contents were to be used for money, then that
would be different. American keyboards on English systems do the same thing.
$ can mean USD or it might mean a hidden drive reference, depending upon
usage. c$ for example, is another way to refer to a drive from a network
perspective but it is hidden. That name doesn't appear by browsing.
--
Richard Lewis Haggard

Post by Ben Bryant
I think I understand most of this issue in terms of Michael Kaplan's
Whats up with the Korean (Unicode) sort?
http://blogs.msdn.com/michkap/archive/2004/12/14/284838.aspx
and When is a backslash not a backslash?
http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx
and others on TheOldNewThing and Larry Osterman
I don't know how they use their keyboards in Japan, but I am assuming they
have a way of typing a yen sign and they use the same thing for a path
separator when typing a pathname into an edit box. Assuming it is a Unicode
edit box, how does the edit box (or really the Windows OS in generating the
keydown message) decide which code point to use for the Yen sign U+005c or
U+00a5? I think you would not have this question in code page 932 since it
would be 5c.
Okay, now assuming it is U+005c in order to support Unicode pathnames, would
there be any way other than processing of the edit box text value to change
them to U+00a5? I could image a flag indicating "treat yens as U+00a5" in
this edit box, or "treat yens as U+005c" in this other edit box depending on
whether it is for regular text including currency discussions, or
Windows/DOS pathnames.
Thanks,
Ben
http://codesnipers.com/?q=blog/3

Michael (michka) Kaplan [MS]

2005-10-22 17:56:22 UTC

This ignores the fsct that for over two decades the yen and the won have
shown up as the path separator on Japanese and Korean systems.
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Richard Lewis Haggard
I would have thought that the difference was not in the character, but in
how it was used. If the yen key was typed into an edit that was destined to
be interpreted as a path, as in explorer, then it is a path delimiter. If
it was entered into an edit whose contents were to be used for money, then
that would be different. American keyboards on English systems do the same
thing. $ can mean USD or it might mean a hidden drive reference, depending
upon usage. c$ for example, is another way to refer to a drive from a
network perspective but it is hidden. That name doesn't appear by browsing.
--
Richard Lewis Haggard

Post by Ben Bryant
I think I understand most of this issue in terms of Michael Kaplan's
Whats up with the Korean (Unicode) sort?
http://blogs.msdn.com/michkap/archive/2004/12/14/284838.aspx
and When is a backslash not a backslash?
http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx
and others on TheOldNewThing and Larry Osterman
I don't know how they use their keyboards in Japan, but I am assuming they
have a way of typing a yen sign and they use the same thing for a path
separator when typing a pathname into an edit box. Assuming it is a Unicode
edit box, how does the edit box (or really the Windows OS in generating the
keydown message) decide which code point to use for the Yen sign U+005c or
U+00a5? I think you would not have this question in code page 932 since it
would be 5c.
Okay, now assuming it is U+005c in order to support Unicode pathnames, would
there be any way other than processing of the edit box text value to change
them to U+00a5? I could image a flag indicating "treat yens as U+00a5" in
this edit box, or "treat yens as U+005c" in this other edit box depending on
whether it is for regular text including currency discussions, or
Windows/DOS pathnames.
Thanks,
Ben
http://codesnipers.com/?q=blog/3

Ben Bryant

2005-10-31 13:34:39 UTC

But there is no character code difference between the $ for USD and hidden
drive reference. There is a character code difference between the yen sign
for money and path seperator. The point is that the programmer must "repair"
it if it refers to money otherwise it will be displayed as a Won in Korea
and a backslash everywhere else. Imagine having a document all about
currency stuff in Japan and sending it internationally where the yen signs
appear as Wons and backslashes, what a mess. I wrote about this at:
http://codesnipers.com/?q=node/128

"Richard Lewis Haggard" <HaggardAtWorldDotStdDotCom> wrote in message news:***@tk2msftngp13.phx.gbl...
American keyboards on English systems do the same thing.
$ can mean USD or it might mean a hidden drive reference, depending upon
usage.

Michael (michka) Kaplan [MS]

2005-10-31 15:24:00 UTC

OR, they could just leave it all as Unicode and have no problem whatsoever?
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
But there is no character code difference between the $ for USD and hidden
drive reference. There is a character code difference between the yen sign
for money and path seperator. The point is that the programmer must "repair"
it if it refers to money otherwise it will be displayed as a Won in Korea
and a backslash everywhere else. Imagine having a document all about
currency stuff in Japan and sending it internationally where the yen signs
http://codesnipers.com/?q=node/128
American keyboards on English systems do the same thing.
$ can mean USD or it might mean a hidden drive reference, depending upon
usage.

Ben Bryant

2005-10-31 22:25:26 UTC

Michael, you keep saying that and completely missing the point! As you said
earlier, when they type on the Japanese they will get the path separator
U+005c, not the yen sign U+00a5. EVEN in Unicode, even in their Unicode
documents about money. Don't you see? So stop saying they will have no
problem whatsoever!

It will appear as a yen sign while in Japan, so no problem there. When they
send that Unicode document to Korea or U.S. the yen signs will appear as Won
signs or backslashes. Therefore, a program needs to "repair" those path
separators to yen signs before the document is shared internationally. Gah!

Can someone else help explain this to Michael?

"Michael (michka) Kaplan [MS]" <***@microsoft.online.com> wrote in
message news:%***@TK2MSFTNGP15.phx.gbl...
OR, they could just leave it all as Unicode and have no problem whatsoever?

Post by Ben Bryant
But there is no character code difference between the $ for USD and hidden
drive reference. There is a character code difference between the yen sign
for money and path seperator. The point is that the programmer must "repair"
it if it refers to money otherwise it will be displayed as a Won in Korea
and a backslash everywhere else. Imagine having a document all about
currency stuff in Japan and sending it internationally where the yen signs
http://codesnipers.com/?q=node/128
American keyboards on English systems do the same thing.
$ can mean USD or it might mean a hidden drive reference, depending upon
usage.

Michael (michka) Kaplan [MS]

2005-11-01 02:25:30 UTC

Um, I get that, Ben.

And when I said "use Unicode" I mean use the Unicode code point for the Yen
or the Won, and NVERSTOP USING UNICODE.

If you do this, then everything works. One conversion in a non-Unicode app
and you are stuck on a 932 o4 949 machine.

It is only hard if you make it hard.
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
Michael, you keep saying that and completely missing the point! As you said
earlier, when they type on the Japanese they will get the path separator
U+005c, not the yen sign U+00a5. EVEN in Unicode, even in their Unicode
documents about money. Don't you see? So stop saying they will have no
problem whatsoever!
It will appear as a yen sign while in Japan, so no problem there. When they
send that Unicode document to Korea or U.S. the yen signs will appear as Won
signs or backslashes. Therefore, a program needs to "repair" those path
separators to yen signs before the document is shared internationally. Gah!
Can someone else help explain this to Michael?
OR, they could just leave it all as Unicode and have no problem whatsoever?

Post by Ben Bryant
But there is no character code difference between the $ for USD and hidden
drive reference. There is a character code difference between the yen sign
for money and path seperator. The point is that the programmer must "repair"
it if it refers to money otherwise it will be displayed as a Won in Korea
and a backslash everywhere else. Imagine having a document all about
currency stuff in Japan and sending it internationally where the yen signs
http://codesnipers.com/?q=node/128
American keyboards on English systems do the same thing.
$ can mean USD or it might mean a hidden drive reference, depending upon
usage.

Ben Bryant

2005-11-01 12:19:55 UTC

The programmer (i.e. the program) needs to repair the yen sign from the
OnKeyDown or existing text based on knowledge or analysis of the meaning of
that text.

This is a problem that exists when using Unicode. This has nothing to do
with any non-Unicode encoding and when you keep talking about non-Unicode
and saying never stop using Unicode you distract from my point. Yes, it is
good advice, but it is a separate issue that misleads the reader into
thinking this problem is linked to the issue of conversion between character
sets.

It is only hard if you keep misleading programmers about it.

Thanks,
Ben

"Michael (michka) Kaplan [MS]" <***@microsoft.online.com> wrote in
message news:***@TK2MSFTNGP09.phx.gbl...
Um, I get that, Ben.

And when I said "use Unicode" I mean use the Unicode code point for the Yen
or the Won, and NVERSTOP USING UNICODE.

If you do this, then everything works. One conversion in a non-Unicode app
and you are stuck on a 932 o4 949 machine.

It is only hard if you make it hard.
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
Michael, you keep saying that and completely missing the point! As you said
earlier, when they type on the Japanese they will get the path separator
U+005c, not the yen sign U+00a5. EVEN in Unicode, even in their Unicode
documents about money. Don't you see? So stop saying they will have no
problem whatsoever!
It will appear as a yen sign while in Japan, so no problem there. When they
send that Unicode document to Korea or U.S. the yen signs will appear as Won
signs or backslashes. Therefore, a program needs to "repair" those path
separators to yen signs before the document is shared internationally. Gah!
Can someone else help explain this to Michael?
OR, they could just leave it all as Unicode and have no problem whatsoever?

Post by Ben Bryant
But there is no character code difference between the $ for USD and hidden
drive reference. There is a character code difference between the yen sign
for money and path seperator. The point is that the programmer must "repair"
it if it refers to money otherwise it will be displayed as a Won in Korea
and a backslash everywhere else. Imagine having a document all about
currency stuff in Japan and sending it internationally where the yen signs
http://codesnipers.com/?q=node/128
American keyboards on English systems do the same thing.
$ can mean USD or it might mean a hidden drive reference, depending upon
usage.

Michael (michka) Kaplan [MS]

2005-11-01 14:28:56 UTC

Huh?

I am talking about how if you have a valid YEN and it goes through 932 or
949 it becomes a reverse solidus. Thus, the non-Unicode is evil if you
manage to get the right characters in.

That is not a disraction, pointing out that both code pages can act as
poison to their respective currency symbols....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
The programmer (i.e. the program) needs to repair the yen sign from the
OnKeyDown or existing text based on knowledge or analysis of the meaning of
that text.
This is a problem that exists when using Unicode. This has nothing to do
with any non-Unicode encoding and when you keep talking about non-Unicode
and saying never stop using Unicode you distract from my point. Yes, it is
good advice, but it is a separate issue that misleads the reader into
thinking this problem is linked to the issue of conversion between character
sets.
It is only hard if you keep misleading programmers about it.
Thanks,
Ben
Um, I get that, Ben.
And when I said "use Unicode" I mean use the Unicode code point for the Yen
or the Won, and NVERSTOP USING UNICODE.
If you do this, then everything works. One conversion in a non-Unicode app
and you are stuck on a 932 o4 949 machine.
It is only hard if you make it hard.
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
Michael, you keep saying that and completely missing the point! As you said
earlier, when they type on the Japanese they will get the path separator
U+005c, not the yen sign U+00a5. EVEN in Unicode, even in their Unicode
documents about money. Don't you see? So stop saying they will have no
problem whatsoever!
It will appear as a yen sign while in Japan, so no problem there. When they
send that Unicode document to Korea or U.S. the yen signs will appear as Won
signs or backslashes. Therefore, a program needs to "repair" those path
separators to yen signs before the document is shared internationally. Gah!
Can someone else help explain this to Michael?
OR, they could just leave it all as Unicode and have no problem whatsoever?

Post by Ben Bryant
But there is no character code difference between the $ for USD and hidden
drive reference. There is a character code difference between the yen sign
for money and path seperator. The point is that the programmer must "repair"
it if it refers to money otherwise it will be displayed as a Won in Korea
and a backslash everywhere else. Imagine having a document all about
currency stuff in Japan and sending it internationally where the yen signs
http://codesnipers.com/?q=node/128
American keyboards on English systems do the same thing.
$ can mean USD or it might mean a hidden drive reference, depending upon
usage.

Ben Bryant

2005-11-01 16:23:53 UTC

Hey thanks for your response and for your mention on your blog. As I said,
avoiding non-Unicode is good advice, but it is still a different issue than
the one I am trying to call attention to.

1. going through 932, 949 will cause you to lose your yen sign.
2. in Unicode, you need to repair your path separators to yen signs where
applicable.

2 is independent of 1.

Thanks,
Ben

"Michael (michka) Kaplan [MS]" <***@microsoft.online.com> wrote in
message news:%***@TK2MSFTNGP09.phx.gbl...
Huh?

I am talking about how if you have a valid YEN and it goes through 932 or
949 it becomes a reverse solidus. Thus, the non-Unicode is evil if you
manage to get the right characters in.

That is not a disraction, pointing out that both code pages can act as
poison to their respective currency symbols....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
The programmer (i.e. the program) needs to repair the yen sign from the
OnKeyDown or existing text based on knowledge or analysis of the meaning of
that text.
This is a problem that exists when using Unicode. This has nothing to do
with any non-Unicode encoding and when you keep talking about non-Unicode
and saying never stop using Unicode you distract from my point. Yes, it is
good advice, but it is a separate issue that misleads the reader into
thinking this problem is linked to the issue of conversion between character
sets.
It is only hard if you keep misleading programmers about it.
Thanks,
Ben
Um, I get that, Ben.
And when I said "use Unicode" I mean use the Unicode code point for the Yen
or the Won, and NVERSTOP USING UNICODE.
If you do this, then everything works. One conversion in a non-Unicode app
and you are stuck on a 932 o4 949 machine.
It is only hard if you make it hard.
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by Ben Bryant
Michael, you keep saying that and completely missing the point! As you said
earlier, when they type on the Japanese they will get the path separator
U+005c, not the yen sign U+00a5. EVEN in Unicode, even in their Unicode
documents about money. Don't you see? So stop saying they will have no
problem whatsoever!
It will appear as a yen sign while in Japan, so no problem there. When they
send that Unicode document to Korea or U.S. the yen signs will appear as Won
signs or backslashes. Therefore, a program needs to "repair" those path
separators to yen signs before the document is shared internationally. Gah!
Can someone else help explain this to Michael?
OR, they could just leave it all as Unicode and have no problem whatsoever?

Post by Ben Bryant
But there is no character code difference between the $ for USD and hidden
drive reference. There is a character code difference between the yen sign
for money and path seperator. The point is that the programmer must "repair"
it if it refers to money otherwise it will be displayed as a Won in Korea
and a backslash everywhere else. Imagine having a document all about
currency stuff in Japan and sending it internationally where the yen signs
http://codesnipers.com/?q=node/128
American keyboards on English systems do the same thing.
$ can mean USD or it might mean a hidden drive reference, depending upon
usage.

Mihai N.

2005-11-01 17:24:11 UTC

Post by Ben Bryant
1. going through 932, 949 will cause you to lose your yen sign.
2. in Unicode, you need to repair your path separators to yen signs
where applicable.
2 is independent of 1.

This is where I think the misunderstanding is.
If you have been Unicode from beginning to the end, there is nothing to
fix. You have something to fix (2) only if you went through 932, 949 (1).
So 2 is dependent on 1.

--
Mihai Nita [Microsoft MVP, Windows - SDK]
------------------------------------------
Replace _year_ with _ to get the real email

Ben Bryant

2005-11-01 18:57:03 UTC

No, this is exactly the point I am having so much trouble getting across. If
you are using Unicode from beginning to end, YOU WILL STILL HAVE THE
PROBLEM. OnKeyDown for the yen sign will produce U+005c IN UNICODE!! Your
program will need to decide that it should be U+00a5 based on knowledge of
text subject (money). 2 is independent of 1. This is not about character set
conversion folks, you all seem to have that stuck in your minds. This is
about a peculiarity of the Japanese and Korean locales IN UNICODE.

Post by Ben Bryant
1. going through 932, 949 will cause you to lose your yen sign.
2. in Unicode, you need to repair your path separators to yen signs
where applicable.
2 is independent of 1.

This is where I think the misunderstanding is.
If you have been Unicode from beginning to the end, there is nothing to
fix. You have something to fix (2) only if you went through 932, 949 (1).
So 2 is dependent on 1.

--
Mihai Nita [Microsoft MVP, Windows - SDK]
------------------------------------------
Replace _year_ with _ to get the real email

Ben Bryant

2005-11-01 19:14:41 UTC

Mihai, I think you answered my question on Michael's blog:
http://blogs.msdn.com/michkap/archive/2005/11/01/487665.aspx
They use the wide yen sign U+FFE5 from the keyboard. Until now I was going
off the beginning of this same thread where it was suggested that U+005c was
the only way of putting in a yen sign. Now as far as I'm concerned there is
unlikely to be any significant problem at all.
Thanks,
Ben

Post by Ben Bryant
1. going through 932, 949 will cause you to lose your yen sign.
2. in Unicode, you need to repair your path separators to yen signs
where applicable.
2 is independent of 1.

This is where I think the misunderstanding is.
If you have been Unicode from beginning to the end, there is nothing to
fix. You have something to fix (2) only if you went through 932, 949 (1).
So 2 is dependent on 1.

--
Mihai Nita [Microsoft MVP, Windows - SDK]
------------------------------------------
Replace _year_ with _ to get the real email

Mihai N.

2005-11-01 19:23:42 UTC

Post by Ben Bryant
http://blogs.msdn.com/michkap/archive/2005/11/01/487665.aspx

That was after thinking a bit more :-)
I should have posted here to.

Post by Ben Bryant
They use the wide yen sign U+FFE5 from the keyboard. Until now I was
going off the beginning of this same thread where it was suggested
that U+005c was the only way of putting in a yen sign. Now as far as
I'm concerned there is unlikely to be any significant problem at all.

I know is not a perfect solution, but I think is an acceptable one.

Post by Ben Bryant
Thanks,
Ben

Welcome. And let's hope it works :-)

--
Mihai Nita [Microsoft MVP, Windows - SDK]
------------------------------------------
Replace _year_ with _ to get the real email

Norman Diamond

2005-11-02 00:45:47 UTC

Mr. Bryant I agree with you that Microsoft's adjustments to Unicode cause
the problem that you're describing here in Unicode. Of course there's no
good solution because path separators are more important than money. (How
many path separators did you pay for your MSDN subscription?)

Now please be warned that the solution that you think you found is not a
solution. It is possible for a user to input a full-width yen sign from the
keyboard via the IME, but it is also possible for a user to input a plain
ordinary half-width yen sign. You cannot force users to abandon ordinary
habits. You cannot tell users that yen signs must always be input as
full-width -- you will not have customers for very long if you try that. I
don't know a solution and I think you were right when you said it's a hard
problem.

Post by Ben Bryant
http://blogs.msdn.com/michkap/archive/2005/11/01/487665.aspx
They use the wide yen sign U+FFE5 from the keyboard. Until now I was going
off the beginning of this same thread where it was suggested that U+005c was
the only way of putting in a yen sign. Now as far as I'm concerned there is
unlikely to be any significant problem at all.
Thanks,
Ben

Post by Ben Bryant
1. going through 932, 949 will cause you to lose your yen sign.
2. in Unicode, you need to repair your path separators to yen signs
where applicable.
2 is independent of 1.

This is where I think the misunderstanding is.
If you have been Unicode from beginning to the end, there is nothing to
fix. You have something to fix (2) only if you went through 932, 949 (1).
So 2 is dependent on 1.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
------------------------------------------
Replace _year_ with _ to get the real email

Mihai N.

2005-11-02 01:57:01 UTC

Post by Norman Diamond
Of course
there's no good solution because path separators are more important
than money. (How many path separators did you pay for your MSDN
subscription?)

For my mom money is more important than path separators :-)
(just a joke, really!)

Post by Norman Diamond
Now please be warned that the solution that you think you found is not
a solution. It is possible for a user to input a full-width yen sign
from the keyboard via the IME, but it is also possible for a user to
input a plain ordinary half-width yen sign. You cannot force users to
abandon ordinary habits. You cannot tell users that yen signs must
always be input as full-width -- you will not have customers for very
long if you try that. I don't know a solution and I think you were
right when you said it's a hard problem.

This discussion is kind of going in parallel in this newsgroup and on
MisKa's blog (http://blogs.msdn.com/michkap/archive/2005/11/01/487665.aspx)

I'm afraid that we all agree that this is not a real solution, and there
is no way a good solution can be devised.
We can add in the mix a smart parser, trying to figure out if the string is
a path, but all is only heuristics, not a real algorithm.

--
Mihai Nita [Microsoft MVP, Windows - SDK]
------------------------------------------
Replace _year_ with _ to get the real email

Norman Diamond

2005-11-02 05:05:42 UTC

"Mihai N." <***@yahoo.com> wrote in message news:***@207.46.248.16...
[Norman Diamond replying to Ben Bryant:]

Post by Mihai N.

Post by Norman Diamond
Now please be warned that the solution that you think you found is not a
solution. It is possible for a user to input a full-width yen sign from
the keyboard via the IME, but it is also possible for a user to input a
plain ordinary half-width yen sign. You cannot force users to abandon
ordinary habits.

This discussion is kind of going in parallel in this newsgroup and on
MisKa's blog
(http://blogs.msdn.com/michkap/archive/2005/11/01/487665.aspx)

OK I looked. And to prove that I looked, here's the latest screenshot of
the ongoing fight between Microsoft's statements and Microsoft's software:
Loading Image...

Also from one of Michael Kaplan's earlier blog pages I took two screenshots
of parts of the same blog page, showing the ongoing three-way fight among
Microsoft's statements, Microsoft's software, and Microsoft's software.
When is a backslash not a backslash, indeed, or when can't it make up its
mind:
Loading Image...

Loading Image...

Post by Mihai N.
The main reason is that the most Japanese text uses the wide versions of
Katakana (including the Windows UI starting with W95).
As a result, the users will use the wide Yen (U+FFE5) when they talk about
currency and the reverse solidus (U+005C) for file paths.

Users can use either the fullwidth yen symbol (U+FFE5) or halfwidth yen
symbol (U+whatever, JIS-Romaji 0x5C) when they talk about currency. Most
common are the Kanji for yen and the halfwidth yen symbol. As a rough
guess, the Kanji is more common in printed materials but the halfwidth
symbol (JIS-Romaji 0x5C) is more common in online shopping. The fullwidth
symbol is less common than either of those.

The Windows UI seems to be pretty much random as to using full-width or
half-width in katakana. Sometimes it uses both. For example if Windows 98
Service Pack 1 is installed onto an existing Windows 98 (first edition)
system then the Start menu ends up with two Accessories folders and two
Address Book entries. If Word 2000 is installed as an upgrade onto an
existing Office 97 suite then the Start menu ends up with four useless
Office shortcuts instead of the usual two. And these are immediately
visible in the Start menu as soon as the user does _anything_ after
installing and/or rebooting. There's no way anyone could miss them. These
are pretty strong evidence of Microsoft products not being tested before
release. And since Microsoft never even released fixes for them, maybe
Microsoft didn't even test them after release, maybe it's only customers who
noticed. Fortunately these cases don't cause loss of data, they just cause
confusion and laughter.

Also in some earlier page that was pointed back to, Mr. Kaplan continued to
assert (after censoring my part of the discussion some months ago) that
Microsoft's corruption to the Unicode sorting order doesn't penalize
Japanese or Korean users for having data that include their currency
symbols. The penalty continues. If anyone had any reason for sorting their
data by numerical values of the codepoints, Windows 95 and Windows 98 and
Unix and other systems do it (0x5C is 0x5C), but Windows NT4 through 2003
don't (0x5C is folded into a different codepoint). Microsoft actually tells
Japanese and Korean users that now Japanese and Korean users can sort their
data the same way Americans always did. Japanese and Korean users who use
Microsoft systems now can't sort their data the way Japanese and Korean
users always did (well, always in the case when they keyed on codepoints
rather than pronunciation or some other geeky criteria).

Michael (michka) Kaplan [MS]

2005-11-02 06:04:39 UTC

These are pretty strong evidence of Microsoft products not being tested
before release. And since Microsoft never even released fixes for them,
maybe Microsoft didn't even test them after release, maybe it's only
customers who noticed.

Norman,

You are wrong. If you had any idea how much testing DOES happen, you might
even feel embarrassed enough for this entirely inaccurate anti-Microsoft
rant of a post. Though I am inclined to doubt it.

Fortunately these cases don't cause loss of data, they just cause
confusion and laughter.

Kind of like the only way people might react to the posts you have been
making here like this last rant?

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Norman Diamond

2005-11-02 06:55:38 UTC

Post by Michael (michka) Kaplan [MS]

These are pretty strong evidence of Microsoft products not being tested
before release. And since Microsoft never even released fixes for them,
maybe Microsoft didn't even test them after release, maybe it's only
customers who noticed.

Norman,
You are wrong. If you had any idea how much testing DOES happen, you might
even feel embarrassed enough for this entirely inaccurate anti-Microsoft
rant of a post. Though I am inclined to doubt it.

I doubt it too. The cases I mentioned here are, as mentioned, impossible to
miss. One of your colleagues admitted privately that Microsoft doesn't test
most language versions of its products as much as it tests English language
versions. On this matter I believe your colleague. I can believe your
insinuation of how much testing DOES happen because you slyly omit
mentioning which language version gets so much testing.

Post by Michael (michka) Kaplan [MS]

Fortunately these cases don't cause loss of data, they just cause
confusion and laughter.

Kind of like the only way people might react to the posts you have been
making here like this last rant?

Fine. Shall we return to cases where Microsoft destroys the entire contents
of hard disk partitions? Those don't even depend on language version. In
one case Microsoft even asserted that there was no need to test or fix --
Microsoft wasn't dogfooding and external SCSI drives weren't commonly used
in one particular country. Maybe you should be confused and laugh about
that too?

Anyone else who wants to be confused and laugh about the facts contained in
rants, feel free.

Norman Diamond

2005-11-01 08:44:19 UTC

Post by Ben Bryant
But there is no character code difference between the $ for USD and hidden
drive reference.

True.

Post by Ben Bryant
There is a character code difference between the yen sign for money and
path seperator.

False. The yen sign is the path separator.

Post by Ben Bryant
The point is that the programmer must "repair" it if it refers to money
otherwise it will be displayed as a Won in Korea and a backslash
everywhere else.

Yes that happens. For comparison, where ASCII and JIS-Romaji have curly
braces for { and }, some code pages have alphabetics. Yes it's a mess. The
C committee invented trigraphs to make it messier. The best we can do for
many petabytes of existing databases and documents is just to know what code
pages were used in writing them.

If Unicode had been invented before ASCII and JIS-Romaji and other code
pages, we could imagine that this problem might be avoided. But memory was
more expensive in those days, so even if Unicode had been invented first,
there would have been too many complaints from some foreign countries where
people didn't want to use 16 bits for each of the characters that they used.
So even in our dreams it's still a mess.

Ben Bryant

2005-11-01 12:25:51 UTC

Thanks for your reply!

Post by Norman Diamond
False. The yen sign is the path separator.

In Unicode there is a different code point although they both appear the
same in the Japanese locale. I am only talking about Unicode here. It is
possible for the program and the OS to operate in Unicode without any
conversion to the "ANSI" (double byte) system code page. The issue I am
describing does not need to involve non-Unicode charsets.

Norman Diamond

2005-11-02 00:53:26 UTC

"Ben Bryant" <***@firstobject.com> wrote in message news:o9J9f.7444$***@dukeread10...
[Norman Diamond:]

Post by Ben Bryant

Post by Norman Diamond
False. The yen sign is the path separator.

After posting that, I figured out that you seem to be more concerned with
Unicode than with code pages. In Unicode the yen sign isn't the path
separator, but Microsoft's conversions from code pages to Unicode create the
same problem. (And they have to do it that way.)

Post by Ben Bryant
In Unicode there is a different code point although they both appear the
same in the Japanese locale. I am only talking about Unicode here.

Again a mixture of truth and falsity, sorry. Unicode has a half-width
reverse solidus character which cannot be displayed at all in an ordinary
Japanese font. Microsoft Word has an option, on by default, to display it
as a half-width yen sign. If the user turns the option off then Microsoft
has to substitute a different font for that character in order to display it
as a half-width reverse solidus. Internet Explorer seems to be a lot more
random in choosing what it will display for that codepoint if the page
encoding isn't Japanese. (If the page encoding is Japanese then there's no
doubt that the yen sign is a yen sign, but then we're not talking Unicode.)

Unicode has a half-width yen sign which is no problem at all. Unicode has
full-width versions of both the reverse solidus and yen sign, and both of
them are no problem at all.

Post by Ben Bryant
It is possible for the program and the OS to operate in Unicode without
any conversion to the "ANSI" (double byte) system code page.

Internally yes, but if you're going to display something to the user then
you still need to figure out how to display what you want. The problem is
every bit as difficult as you said it is.

Ben Bryant

2005-11-01 16:35:24 UTC

The best we can do for many petabytes of existing databases and documents
is just to know what code pages were used in writing them

Knowing the originating code page is important, but I believe the path
separator/yen/won problem is a unique problem that is different than any
character set incompatibility. It is unique because of the role of the path
separator in the OS and the locale font work-around.

But there is no character code difference between the $ for USD and hidden
drive reference.

True.

There is a character code difference between the yen sign for money and
path seperator.

False. The yen sign is the path separator.

The point is that the programmer must "repair" it if it refers to money
otherwise it will be displayed as a Won in Korea and a backslash
everywhere else.

Yes that happens. For comparison, where ASCII and JIS-Romaji have curly
braces for { and }, some code pages have alphabetics. Yes it's a mess. The
C committee invented trigraphs to make it messier. The best we can do for
many petabytes of existing databases and documents is just to know what code
pages were used in writing them.

If Unicode had been invented before ASCII and JIS-Romaji and other code
pages, we could imagine that this problem might be avoided. But memory was
more expensive in those days, so even if Unicode had been invented first,
there would have been too many complaints from some foreign countries where
people didn't want to use 16 bits for each of the characters that they used.
So even in our dreams it's still a mess.

27 Replies
1192 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Ben Bryant 2005-10-15 19:17:29 UTC

Michael (michka) Kaplan [MS] 2005-10-15 21:18:43 UTC

Ben Bryant 2005-10-15 21:31:06 UTC

Michael (michka) Kaplan [MS] 2005-10-16 01:05:50 UTC

Norman Diamond 2005-10-17 00:59:11 UTC

Michael (michka) Kaplan [MS] 2005-10-17 07:10:09 UTC

Richard Lewis Haggard 2005-10-22 02:45:53 UTC

Michael (michka) Kaplan [MS] 2005-10-22 17:56:22 UTC

Ben Bryant 2005-10-31 13:34:39 UTC

Michael (michka) Kaplan [MS] 2005-10-31 15:24:00 UTC

Ben Bryant 2005-10-31 22:25:26 UTC

Michael (michka) Kaplan [MS] 2005-11-01 02:25:30 UTC

Ben Bryant 2005-11-01 12:19:55 UTC

Michael (michka) Kaplan [MS] 2005-11-01 14:28:56 UTC

Ben Bryant 2005-11-01 16:23:53 UTC

Mihai N. 2005-11-01 17:24:11 UTC

Ben Bryant 2005-11-01 18:57:03 UTC

Ben Bryant 2005-11-01 19:14:41 UTC

Mihai N. 2005-11-01 19:23:42 UTC

Norman Diamond 2005-11-02 00:45:47 UTC

Mihai N. 2005-11-02 01:57:01 UTC

Norman Diamond 2005-11-02 05:05:42 UTC

Michael (michka) Kaplan [MS] 2005-11-02 06:04:39 UTC

Norman Diamond 2005-11-02 06:55:38 UTC

Norman Diamond 2005-11-01 08:44:19 UTC

Ben Bryant 2005-11-01 12:25:51 UTC

Norman Diamond 2005-11-02 00:53:26 UTC

Ben Bryant 2005-11-01 16:35:24 UTC

about - legalese

Loading...