Claus Brod
2006-04-18 20:15:02 UTC
Hi all,
if this subject has already been beaten to death umpteen times, I apologize.
I did my share of googling around to find previous answers, but maybe I
missed some obious search terms; if so, sorry for this, and thanks for any
pointers.
Looking at the I18N environment for Win32 which Microsoft has laid out for
us C/C++ programmers, there are two obvious preferred encodings for
characters: UTF-8 and UTF-16.
UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes. Only a few places where data is fed into
the application or is exported to files/sockets/UI/web pages/whatever need to
know explicitly about which encoding are expected on the other side, and then
to apply the right conversion.
However, none of the Microsoft libraries seem to be suited for UTF-8. The C
and C++ runtime implicitly assume that "char *" strings are encoded according
to the current locale. Since UTF-8 cannot be specified in a locale, this
means - typically - that the C run C++ runtime assume that all strings are
encoded in ISO8859-1 aka Latin1 (or maybe codepage 1252 or whatever the
codepage number is), even though the string may actually be encoded according
to UTF-8. As a result, many string functions will fail to work correctly, or
may even corrupt the string.
std::string doesn't know about UTF-8 either, nor does CString (MFC). If you
assign a "char *" string to a CString in non-UNICODE mode, the CString
constructor will assume that the input string is encoded according to the
current codepage, i.e. it will typically convert from Latin1 to UTF-16, even
if the input string actually is in UTF-8.
So it seems that the whole setup more or less urges developers to move to
UTF-16, even though this will typically require more formal changes
throughout the code, and will also increase the memory footprint.
In an ideal world (IMHO), the Microsoft C/C++ runtime would allow to say
something like setlocale(LC_ALL, "german.UTF-8") and then know how to handle
UTF-8 strings. However, this does not seem to be possible, as previous posts
and discussions have shown. Or am I missing something obvious? Have you been
in a similar situation, and how did you decide?
Thanks!
Claus
http://www.clausbrod.de/Blog/BlogOnSoftware
if this subject has already been beaten to death umpteen times, I apologize.
I did my share of googling around to find previous answers, but maybe I
missed some obious search terms; if so, sorry for this, and thanks for any
pointers.
Looking at the I18N environment for Win32 which Microsoft has laid out for
us C/C++ programmers, there are two obvious preferred encodings for
characters: UTF-8 and UTF-16.
UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes. Only a few places where data is fed into
the application or is exported to files/sockets/UI/web pages/whatever need to
know explicitly about which encoding are expected on the other side, and then
to apply the right conversion.
However, none of the Microsoft libraries seem to be suited for UTF-8. The C
and C++ runtime implicitly assume that "char *" strings are encoded according
to the current locale. Since UTF-8 cannot be specified in a locale, this
means - typically - that the C run C++ runtime assume that all strings are
encoded in ISO8859-1 aka Latin1 (or maybe codepage 1252 or whatever the
codepage number is), even though the string may actually be encoded according
to UTF-8. As a result, many string functions will fail to work correctly, or
may even corrupt the string.
std::string doesn't know about UTF-8 either, nor does CString (MFC). If you
assign a "char *" string to a CString in non-UNICODE mode, the CString
constructor will assume that the input string is encoded according to the
current codepage, i.e. it will typically convert from Latin1 to UTF-16, even
if the input string actually is in UTF-8.
So it seems that the whole setup more or less urges developers to move to
UTF-16, even though this will typically require more formal changes
throughout the code, and will also increase the memory footprint.
In an ideal world (IMHO), the Microsoft C/C++ runtime would allow to say
something like setlocale(LC_ALL, "german.UTF-8") and then know how to handle
UTF-8 strings. However, this does not seem to be possible, as previous posts
and discussions have shown. Or am I missing something obvious? Have you been
in a similar situation, and how did you decide?
Thanks!
Claus
http://www.clausbrod.de/Blog/BlogOnSoftware