Discussion:
Surrogate support in C runtime library/Win32 API
(too old to reply)
Claus Brod
2006-06-22 22:19:15 UTC
Permalink
I'm working on a Windows app which uses UTF-16 as its internal Unicode
encoding. We are now looking at the subject of supplementary characters
outside of the BMP and their encoding into surrogate pairs.

From what I've learnt so far from using the search engines and consulting
documentation from Microsoft, I'm getting mixed signals both about the
relevance of the characters outside of the BMP and the practicality of using
the Win32 API and/or the C runtime library to process surrogate pairs.

On the relevance of supplementary characters: As far as I could find out so
far, Windows does not ship with fonts which cover Unicode planes 1 and 2. So
by default, if a text contains a surrogate pair, the user will not be able to
display it correctly on a Windows system. The user can, of course, install
additional fonts which cover those planes. I am not aware of such a font from
Microsoft, though, but maybe I haven't been looking carefully enough.

So I'm asking myself: If an ordinary user cannot display those supplementary
characters, how relevant can they be in practice anyway?

(I'm guessing that the supplementary characters are probably most useful for
speakers of Chinese. If any native speakers of Chinese frequent this forum,
I'd love to hear your input on this question.)

Regarding practicality: wchar_t is a 16-bit type on Windows, i.e. a wchar_t
is not wide enough to hold a character outside of the BMP. Functions such as
iswspace() or iswalnum() accept a parameter of type wint_t, which is defined
as unsigned short - so the character type classification functions won't work
for supplementary characters, either. In one of Michael Kaplan's blog
entries, I found that he confirmed that not even the low-level Win32
CharNext() function properly handles surrogate pairs. Explicit surrogate
support seems to be the rare exception in the Win32 API.

Again, I'd love to be proven wrong and to be educated about how to use the
Win32 API and/or the C runtime to properly handle surrogate pairs. So far, it
looks as if we really want to support them, we'd have to roll our own
implementation of iswXXX(), towlower(), towupper() etc etc....

Thanks for any hints or pointers!

Claus
Mihai N.
2006-06-23 02:42:21 UTC
Permalink
Post by Claus Brod
As far as I could find out so
far, Windows does not ship with fonts which cover Unicode planes 1 and 2.
...
Post by Claus Brod
I am not aware of such a font from
Microsoft, though, but maybe I haven't been looking carefully enough.
The support for stuff beyond BMP is required for GB-18030, so here it is
the MS add-on, including a free font:
http://www.microsoft.com/downloads/details.aspx?familyid=FC02E2E3-14BB-46C1-
AFEE-3732D6249647&displaylang=en
Post by Claus Brod
So I'm asking myself: If an ordinary user cannot display those
supplementary characters, how relevant can they be in practice anyway?
For now, if you want to sell in China, you have to be GB-18030 compliant.
So you have to support surrogates.
Post by Claus Brod
Explicit surrogate support seems to be the rare exception in the Win32 API.
Most API is surrogate aware, those that are not, are exceptions
(and bugs should be filed).
This is a gradual process, XP was better than 2000, and Vista is better than
XP.
Post by Claus Brod
the C runtime to properly handle surrogate pairs.
...
Post by Claus Brod
So far, it
looks as if we really want to support them, we'd have to roll our own
implementation of iswXXX(), towlower(), towupper() etc etc....
Working at character level is not good for internationalization and to
properly handle Unicode strings. A character can have several code points
even without surrogates (see combining characters), for some languages
upper-lower case conversion needs more context, or might require one to
many or many to one mappings.

The C/C++ standard library is outdated in this respect, and wchar_t is
just a quick patch without to much thought.

For string handling see Win32 NSL API (like GetStringType, GetStringTypeEx,
CharLower, CharUpper, CharUpperBuff, CharLowerBuff, CompareString, etc.)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Loading...