Claus Brod
2006-06-22 22:19:15 UTC
I'm working on a Windows app which uses UTF-16 as its internal Unicode
encoding. We are now looking at the subject of supplementary characters
outside of the BMP and their encoding into surrogate pairs.
From what I've learnt so far from using the search engines and consulting
documentation from Microsoft, I'm getting mixed signals both about the
relevance of the characters outside of the BMP and the practicality of using
the Win32 API and/or the C runtime library to process surrogate pairs.
On the relevance of supplementary characters: As far as I could find out so
far, Windows does not ship with fonts which cover Unicode planes 1 and 2. So
by default, if a text contains a surrogate pair, the user will not be able to
display it correctly on a Windows system. The user can, of course, install
additional fonts which cover those planes. I am not aware of such a font from
Microsoft, though, but maybe I haven't been looking carefully enough.
So I'm asking myself: If an ordinary user cannot display those supplementary
characters, how relevant can they be in practice anyway?
(I'm guessing that the supplementary characters are probably most useful for
speakers of Chinese. If any native speakers of Chinese frequent this forum,
I'd love to hear your input on this question.)
Regarding practicality: wchar_t is a 16-bit type on Windows, i.e. a wchar_t
is not wide enough to hold a character outside of the BMP. Functions such as
iswspace() or iswalnum() accept a parameter of type wint_t, which is defined
as unsigned short - so the character type classification functions won't work
for supplementary characters, either. In one of Michael Kaplan's blog
entries, I found that he confirmed that not even the low-level Win32
CharNext() function properly handles surrogate pairs. Explicit surrogate
support seems to be the rare exception in the Win32 API.
Again, I'd love to be proven wrong and to be educated about how to use the
Win32 API and/or the C runtime to properly handle surrogate pairs. So far, it
looks as if we really want to support them, we'd have to roll our own
implementation of iswXXX(), towlower(), towupper() etc etc....
Thanks for any hints or pointers!
Claus
encoding. We are now looking at the subject of supplementary characters
outside of the BMP and their encoding into surrogate pairs.
From what I've learnt so far from using the search engines and consulting
documentation from Microsoft, I'm getting mixed signals both about the
relevance of the characters outside of the BMP and the practicality of using
the Win32 API and/or the C runtime library to process surrogate pairs.
On the relevance of supplementary characters: As far as I could find out so
far, Windows does not ship with fonts which cover Unicode planes 1 and 2. So
by default, if a text contains a surrogate pair, the user will not be able to
display it correctly on a Windows system. The user can, of course, install
additional fonts which cover those planes. I am not aware of such a font from
Microsoft, though, but maybe I haven't been looking carefully enough.
So I'm asking myself: If an ordinary user cannot display those supplementary
characters, how relevant can they be in practice anyway?
(I'm guessing that the supplementary characters are probably most useful for
speakers of Chinese. If any native speakers of Chinese frequent this forum,
I'd love to hear your input on this question.)
Regarding practicality: wchar_t is a 16-bit type on Windows, i.e. a wchar_t
is not wide enough to hold a character outside of the BMP. Functions such as
iswspace() or iswalnum() accept a parameter of type wint_t, which is defined
as unsigned short - so the character type classification functions won't work
for supplementary characters, either. In one of Michael Kaplan's blog
entries, I found that he confirmed that not even the low-level Win32
CharNext() function properly handles surrogate pairs. Explicit surrogate
support seems to be the rare exception in the Win32 API.
Again, I'd love to be proven wrong and to be educated about how to use the
Win32 API and/or the C runtime to properly handle surrogate pairs. So far, it
looks as if we really want to support them, we'd have to roll our own
implementation of iswXXX(), towlower(), towupper() etc etc....
Thanks for any hints or pointers!
Claus