Choosing the right encoding: UTF-8 vs. UTF-16

Discussion:

Choosing the right encoding: UTF-8 vs. UTF-16

(too old to reply)

Claus Brod

2006-04-18 20:15:02 UTC

Hi all,

if this subject has already been beaten to death umpteen times, I apologize.
I did my share of googling around to find previous answers, but maybe I
missed some obious search terms; if so, sorry for this, and thanks for any
pointers.

Looking at the I18N environment for Win32 which Microsoft has laid out for
us C/C++ programmers, there are two obvious preferred encodings for
characters: UTF-8 and UTF-16.

UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes. Only a few places where data is fed into
the application or is exported to files/sockets/UI/web pages/whatever need to
know explicitly about which encoding are expected on the other side, and then
to apply the right conversion.

However, none of the Microsoft libraries seem to be suited for UTF-8. The C
and C++ runtime implicitly assume that "char *" strings are encoded according
to the current locale. Since UTF-8 cannot be specified in a locale, this
means - typically - that the C run C++ runtime assume that all strings are
encoded in ISO8859-1 aka Latin1 (or maybe codepage 1252 or whatever the
codepage number is), even though the string may actually be encoded according
to UTF-8. As a result, many string functions will fail to work correctly, or
may even corrupt the string.

std::string doesn't know about UTF-8 either, nor does CString (MFC). If you
assign a "char *" string to a CString in non-UNICODE mode, the CString
constructor will assume that the input string is encoded according to the
current codepage, i.e. it will typically convert from Latin1 to UTF-16, even
if the input string actually is in UTF-8.

So it seems that the whole setup more or less urges developers to move to
UTF-16, even though this will typically require more formal changes
throughout the code, and will also increase the memory footprint.

In an ideal world (IMHO), the Microsoft C/C++ runtime would allow to say
something like setlocale(LC_ALL, "german.UTF-8") and then know how to handle
UTF-8 strings. However, this does not seem to be possible, as previous posts
and discussions have shown. Or am I missing something obvious? Have you been
in a similar situation, and how did you decide?

Thanks!

Claus

http://www.clausbrod.de/Blog/BlogOnSoftware

David Wilkinson

2006-04-18 21:54:15 UTC

Post by Claus Brod
Hi all,
if this subject has already been beaten to death umpteen times, I apologize.
I did my share of googling around to find previous answers, but maybe I
missed some obious search terms; if so, sorry for this, and thanks for any
pointers.
Looking at the I18N environment for Win32 which Microsoft has laid out for
us C/C++ programmers, there are two obvious preferred encodings for
characters: UTF-8 and UTF-16.
UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes. Only a few places where data is fed into
the application or is exported to files/sockets/UI/web pages/whatever need to
know explicitly about which encoding are expected on the other side, and then
to apply the right conversion.

If you had coded as Microsoft has advised for at least 10 years, you
would have used TCHAR, _T("") and all that, and conversion to "Unicode"
should have been very easy.

Post by Claus Brod
However, none of the Microsoft libraries seem to be suited for UTF-8. The C
and C++ runtime implicitly assume that "char *" strings are encoded according
to the current locale. Since UTF-8 cannot be specified in a locale, this
means - typically - that the C run C++ runtime assume that all strings are
encoded in ISO8859-1 aka Latin1 (or maybe codepage 1252 or whatever the
codepage number is), even though the string may actually be encoded according
to UTF-8. As a result, many string functions will fail to work correctly, or
may even corrupt the string.

Examples?

Post by Claus Brod
std::string doesn't know about UTF-8 either, nor does CString (MFC). If you
assign a "char *" string to a CString in non-UNICODE mode, the CString
constructor will assume that the input string is encoded according to the
current codepage, i.e. it will typically convert from Latin1 to UTF-16, even
if the input string actually is in UTF-8.

I think you mean: If you assign a "char *" string to a CString in
UNICODE mode .... This indeed is one of the most dangerous features of
CString. IMHO these "conversion" constructors should be absent, and I
have removed them from my copies of the MFC headers (in VC6).

Post by Claus Brod
So it seems that the whole setup more or less urges developers to move to
UTF-16, even though this will typically require more formal changes
throughout the code, and will also increase the memory footprint.

More memory, yes, at least for Western languages. More changes, not
necessarily.

Post by Claus Brod
In an ideal world (IMHO), the Microsoft C/C++ runtime would allow to say
something like setlocale(LC_ALL, "german.UTF-8") and then know how to handle
UTF-8 strings. However, this does not seem to be possible, as previous posts
and discussions have shown. Or am I missing something obvious? Have you been
in a similar situation, and how did you decide?

Personally, I have a back end which uses std::string and UTF-8, and I
convert to/from UTF-16 and back when I move strings from/to the back
end. Indeed my back-end was always 8-bit, and so I had to make no
changes there when I moved it from local code page to UTF-8.

For me, UTF-8 would have been a better choice for the operating system,
considering that UTF-16 now has surrogate pairs.

David Wilkinson

David Wilkinson

2006-04-18 21:54:05 UTC

Post by Claus Brod
Hi all,
if this subject has already been beaten to death umpteen times, I apologize.
I did my share of googling around to find previous answers, but maybe I
missed some obious search terms; if so, sorry for this, and thanks for any
pointers.
Looking at the I18N environment for Win32 which Microsoft has laid out for
us C/C++ programmers, there are two obvious preferred encodings for
characters: UTF-8 and UTF-16.
UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes. Only a few places where data is fed into
the application or is exported to files/sockets/UI/web pages/whatever need to
know explicitly about which encoding are expected on the other side, and then
to apply the right conversion.

If you had coded as Microsoft has advised for at least 10 years, you
would have used TCHAR, _T("") and all that, and conversion to "Unicode"
should have been very easy.

Post by Claus Brod
However, none of the Microsoft libraries seem to be suited for UTF-8. The C
and C++ runtime implicitly assume that "char *" strings are encoded according
to the current locale. Since UTF-8 cannot be specified in a locale, this
means - typically - that the C run C++ runtime assume that all strings are
encoded in ISO8859-1 aka Latin1 (or maybe codepage 1252 or whatever the
codepage number is), even though the string may actually be encoded according
to UTF-8. As a result, many string functions will fail to work correctly, or
may even corrupt the string.

Examples?

Post by Claus Brod
std::string doesn't know about UTF-8 either, nor does CString (MFC). If you
assign a "char *" string to a CString in non-UNICODE mode, the CString
constructor will assume that the input string is encoded according to the
current codepage, i.e. it will typically convert from Latin1 to UTF-16, even
if the input string actually is in UTF-8.

I think you mean: If you assign a "char *" string to a CString in
UNICODE mode .... This indeed is one of the most dangerous features of
CString. IMHO these "conversion" constructors should be absent, and I
have removed them from my copies of the MFC headers (in VC6).

Post by Claus Brod
So it seems that the whole setup more or less urges developers to move to
UTF-16, even though this will typically require more formal changes
throughout the code, and will also increase the memory footprint.

More memory, yes, at least for Western languages. More changes, not
necessarily.

Post by Claus Brod
In an ideal world (IMHO), the Microsoft C/C++ runtime would allow to say
something like setlocale(LC_ALL, "german.UTF-8") and then know how to handle
UTF-8 strings. However, this does not seem to be possible, as previous posts
and discussions have shown. Or am I missing something obvious? Have you been
in a similar situation, and how did you decide?

Personally, I have a back end which uses std::string and UTF-8, and I
convert to/from UTF-16 and back when I move strings from/to the back
end. Indeed my back-end was always 8-bit, and so I had to make no
changes there when I moved it from local code page to UTF-8.

For me, UTF-8 would have been a better choice for the operating system,
considering that UTF-16 now has surrogate pairs.

David Wilkinson

David Wilkinson

2006-04-18 21:54:51 UTC

Post by Claus Brod
Hi all,
if this subject has already been beaten to death umpteen times, I apologize.
I did my share of googling around to find previous answers, but maybe I
missed some obious search terms; if so, sorry for this, and thanks for any
pointers.
Looking at the I18N environment for Win32 which Microsoft has laid out for
us C/C++ programmers, there are two obvious preferred encodings for
characters: UTF-8 and UTF-16.
UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes. Only a few places where data is fed into
the application or is exported to files/sockets/UI/web pages/whatever need to
know explicitly about which encoding are expected on the other side, and then
to apply the right conversion.

If you had coded as Microsoft has advised for at least 10 years, you
would have used TCHAR, _T("") and all that, and conversion to "Unicode"
should have been very easy.

Post by Claus Brod
However, none of the Microsoft libraries seem to be suited for UTF-8. The C
and C++ runtime implicitly assume that "char *" strings are encoded according
to the current locale. Since UTF-8 cannot be specified in a locale, this
means - typically - that the C run C++ runtime assume that all strings are
encoded in ISO8859-1 aka Latin1 (or maybe codepage 1252 or whatever the
codepage number is), even though the string may actually be encoded according
to UTF-8. As a result, many string functions will fail to work correctly, or
may even corrupt the string.

Examples?

Post by Claus Brod
std::string doesn't know about UTF-8 either, nor does CString (MFC). If you
assign a "char *" string to a CString in non-UNICODE mode, the CString
constructor will assume that the input string is encoded according to the
current codepage, i.e. it will typically convert from Latin1 to UTF-16, even
if the input string actually is in UTF-8.

I think you mean: If you assign a "char *" string to a CString in
UNICODE mode .... This indeed is one of the most dangerous features of
CString. IMHO these "conversion" constructors should be absent, and I
have removed them from my copies of the MFC headers (in VC6).

Post by Claus Brod
So it seems that the whole setup more or less urges developers to move to
UTF-16, even though this will typically require more formal changes
throughout the code, and will also increase the memory footprint.

More memory, yes, at least for Western languages. More changes, not
necessarily.

Post by Claus Brod
In an ideal world (IMHO), the Microsoft C/C++ runtime would allow to say
something like setlocale(LC_ALL, "german.UTF-8") and then know how to handle
UTF-8 strings. However, this does not seem to be possible, as previous posts
and discussions have shown. Or am I missing something obvious? Have you been
in a similar situation, and how did you decide?

Personally, I have a back end which uses std::string and UTF-8, and I
convert to/from UTF-16 and back when I move strings from/to the back
end. Indeed my back-end was always 8-bit, and so I had to make no
changes there when I moved it from local code page to UTF-8.

For me, UTF-8 would have been a better choice for the operating system,
considering that UTF-16 now has surrogate pairs.

David Wilkinson

Claus Brod

2006-04-18 22:31:02 UTC

David,

thanks for your prompt reply!

Post by David Wilkinson
If you had coded as Microsoft has advised for at least 10 years, you
would have used TCHAR, _T("") and all that, and conversion to "Unicode"
should have been very easy.

Definitely easier, that's true. However, we're talking about a codebase with
a Unix background, so no TCHARs or __T's.

Post by David Wilkinson

Post by Claus Brod
codepage number is), even though the string may actually be encoded according
to UTF-8. As a result, many string functions will fail to work correctly, or
may even corrupt the string.

Examples?

Off the top of my head: str*cmp() will not work as expected (except when
used to strictly test for equality). fopen(), when passed a UTF-8 filename,
will create funny filenames in the file system. When strings are printed to
the console, an automatic conversion into the console codepage takes place,
which assumes that the original codepage of the string was the current system
codepage. std::string internally uses C runtime string functions (by
default), so it is affected by the same fundamental issues.

BTW, yes indeed, I did mean CString in _UNICODE mode, thanks for pointing
this out.

Claus

Norman Diamond

2006-04-19 02:32:10 UTC

Post by Claus Brod
UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes.

Huh? Do functions such as IsLeadByte work with UTF-8?

Post by Claus Brod
However, none of the Microsoft libraries seem to be suited for UTF-8. The
C and C++ runtime implicitly assume that "char *" strings are encoded
according to the current locale.

No kidding, that's what it's for. Consider also what happens with
filenames. In an NTFS partition filenames are recorded in UTF-16, but in
FAT and in file systems that were designed in pre-Unicode days the filenames
are recorded in whatever code page the computer was using when the files
were written. If you suddenly decide you're going to try to interpret the
same codepoints according to the UTF-8 code page, of course you'll get
garbage.

Post by Claus Brod
So it seems that the whole setup more or less urges developers to move to
UTF-16,

No kidding. Except that you still have to deal with code pages when
accessing files, e-mail, web sites, etc.

Mihai N.

2006-04-19 06:33:52 UTC

Post by Claus Brod
UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes.

And will continue to work as bad as before. Code replacing \ with something
else, unadvertedly damaging the second half of a Japanese character, and so
on.

Post by Claus Brod
Only a few places where data is fed into
the application or is exported to files/sockets/UI/web pages/whatever need
to know explicitly about which encoding are expected on the other side,
and then to apply the right conversion.

Depends on the application. In general, the more processing is done on the
string, the more difficult it is to use utf-8. utf-8 is good to move things
around without breaking old applications. utf-16 is better for processing.

Post by Claus Brod
However, none of the Microsoft libraries seem to be suited for UTF-8. The C
and C++ runtime implicitly assume that "char *" strings are encoded
according to the current locale. Since UTF-8 cannot be specified in a
locale, this
ISO8859-1 aka Latin1 (or maybe codepage 1252 or whatever the
codepage number is)

ISO8859-1 is Latin1, 1252 is ISO8859-1 plus some extras.

Post by Claus Brod
std::string doesn't know about UTF-8 either

This is true for any OS, not only Windows.

Post by Claus Brod
So it seems that the whole setup more or less urges developers to move to
UTF-16, even though this will typically require more formal changes
throughout the code, and will also increase the memory footprint.

True. But the memory footprint is the same between utf-8 and utf-16 for
pretty much everything but English. And for Kanji, utf-8 is worse.

Post by Claus Brod
Have you been in a similar situation, and how did you decide?

Moved to utf-16.

If you think the Linux/Unix world did the right thing with utf-8, think
again. In most cases the kernel is agnostic, just moves bytes arround.
In fact, Linux is so flexible that you can shoot yourself in the foot.
Try this: go to command prompt, do a LANG=ja-JP.Shift_JIS and create a
file with Japanese name, then do LANG=ru.koi-8 and ls. Create a file with
Russian name. The LANG=en-US.UTF-8 and ls. You get junk.
In fact, at this stage you cannot do a proper backup if you don't ignore
the code pages at all, because Shift-JIS and KOI-8 don't make valid UTF-8
sequences.

Many non-MS libraries and programming languages selected utf-16.
Java (Sun), ICU (IBM), Xerces/Xalan (Apache),
the native Mac OS X API (Apple), Qt (Trolltech).

See here for some good points: http://www.unicode.org/notes/tn12/

Yes, I know it is painfull to change all that code to utf-16, but
blaming it on MS is not a solution. Is really not it's fault :-)

--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Claus Brod

2006-04-19 08:59:02 UTC

Post by Mihai N.
And will continue to work as bad as before. Code replacing \ with something
else, unadvertedly damaging the second half of a Japanese character, and so
on.

True. (The codebase in question has been tested under such conditions for
years.)

Post by Mihai N.

Post by Claus Brod
std::string doesn't know about UTF-8 either

This is true for any OS, not only Windows.

True again. It just fits into the overall picture.

Post by Mihai N.
If you think the Linux/Unix world did the right thing with utf-8, think
again. In most cases the kernel is agnostic, just moves bytes arround.
In fact, Linux is so flexible that you can shoot yourself in the foot.
Try this: go to command prompt, do a LANG=ja-JP.Shift_JIS and create a
file with Japanese name, then do LANG=ru.koi-8 and ls. Create a file with
Russian name. The LANG=en-US.UTF-8 and ls. You get junk.

I'm not arguing about Linux; I'm trying to find the best model for
characters and strings on Windows.

On second thought, since the underlying filesystems use UTF-16 on Windows
(at least the newer ones), the kind of situation which you describe should be
a non-issue on Windows. The app may set its locale to something funny, but
it's the responsibility of the runtime library and/or the OS to convert
strings in that locale to UTF-16 when the bits (including the filename) are
finally shoved to the filesystem. As long as the application isn't lying
about the encoding of its strings, that should work reasonably well.

So if the app sets its locale to something UTF8-ish, then the runtime
libraries would know that the strings are encoded in UTF-8, and could convert
accordingly when they need to call the OS.

Post by Mihai N.
Yes, I know it is painfull to change all that code to utf-16, but
blaming it on MS is not a solution. Is really not it's fault :-)

I'm not trying to blame them, just trying to understand my options.
(Although I do wish they'd support UTF-8 locales in their runtime libraries,
so that at least I'd _have_ options...)

Thanks for your help!

Claus

Mihai N.

2006-04-20 06:58:14 UTC

Post by Claus Brod
On second thought, since the underlying filesystems use UTF-16 on Windows
(at least the newer ones), the kind of situation which you describe should
be a non-issue on Windows.

Correct. And not only the filesystem, but the kernel itsels is utf-16
(well, it started as ucs2, and was "patched" to utf-16, but I only know
a couple of api where is shows :-)

Post by Claus Brod
The app may set its locale to something funny, but
it's the responsibility of the runtime library and/or the OS to convert
strings in that locale to UTF-16 when the bits (including the filename) are
finally shoved to the filesystem. As long as the application isn't lying
about the encoding of its strings, that should work reasonably well.
So if the app sets its locale to something UTF8-ish, then the runtime
libraries would know that the strings are encoded in UTF-8, and could
convert accordingly when they need to call the OS.

On Windows it does not work quite like this.
There are just 2 sets of api: ansi and unicode.
All ansi calls get converted to unicode using the default system locale page.
As the names says, it is a system thing (and requires a reboot), so the
application has no control.

This is the story, a bit more detailed:
http://www.mihai-nita.net/20050306b.shtml

Post by Claus Brod
I'm not trying to blame them, just trying to understand my options.
(Although I do wish they'd support UTF-8 locales in their runtime
libraries, so that at least I'd _have_ options...)

Unfortunately, you don't have the utf-8 option.
You can go with ansi (and will be unable to handle any
data outside the current code page) or you go with utf-16.

--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

James Brown [MVP]

2006-04-19 09:17:30 UTC

Post by Claus Brod
Hi all,
if this subject has already been beaten to death umpteen times, I apologize.
I did my share of googling around to find previous answers, but maybe I
missed some obious search terms; if so, sorry for this, and thanks for any
pointers.
Looking at the I18N environment for Win32 which Microsoft has laid out for
us C/C++ programmers, there are two obvious preferred encodings for
characters: UTF-8 and UTF-16.

There is only one preferred encoding for Windows - UTF-16. This is the
dominant encoding-format even outside of the Windows world. UTF-8 is used
extensively for web,xml, things like that, but for actual platform support
(Windows, OSX, Java, C# etc) UTF-16 is the norm.

Post by Claus Brod
UTF-8 is an obvious choice because it allows to leave much of the original
codepage-based code basically unchanged - programmers can continue to use
code based on "char *" strings, and most of the code will probably just
continue to work without changes. Only a few places where data is fed into
the application or is exported to files/sockets/UI/web pages/whatever need to
know explicitly about which encoding are expected on the other side, and then
to apply the right conversion.

Agreed - UTF-8 is a far nicer encoding to work with for the C/C++
programmer.

Post by Claus Brod
However, none of the Microsoft libraries seem to be suited for UTF-8. The C
and C++ runtime implicitly assume that "char *" strings are encoded according
to the current locale. Since UTF-8 cannot be specified in a locale, this
means - typically - that the C run C++ runtime assume that all strings are
encoded in ISO8859-1 aka Latin1 (or maybe codepage 1252 or whatever the
codepage number is), even though the string may actually be encoded according
to UTF-8. As a result, many string functions will fail to work correctly, or
may even corrupt the string.
std::string doesn't know about UTF-8 either, nor does CString (MFC). If you
assign a "char *" string to a CString in non-UNICODE mode, the CString
constructor will assume that the input string is encoded according to the
current codepage, i.e. it will typically convert from Latin1 to UTF-16, even
if the input string actually is in UTF-8.
So it seems that the whole setup more or less urges developers to move to
UTF-16, even though this will typically require more formal changes
throughout the code, and will also increase the memory footprint.

Correct, this has been the case for over 10 years now

Post by Claus Brod
In an ideal world (IMHO), the Microsoft C/C++ runtime would allow to say
something like setlocale(LC_ALL, "german.UTF-8") and then know how to handle
UTF-8 strings. However, this does not seem to be possible, as previous posts
and discussions have shown. Or am I missing something obvious? Have you been
in a similar situation, and how did you decide?
Thanks!
Claus

In my opinion you should use the same encoding that your API/OS/libaries are
using. In the case of Windows the native encoding is UTF-16, so this is what
you should be using in your software. Many new win32-APIs are UTF-16 only
(i.e. require wchar_t / WCHAR string types to be used). So a Windows C
programmer should be using WCHAR exclusively. It is a mistake to think that
using UTF-8 internally, and coverting to/from UTF-16 when you speak to the
OS, will result in a better program. It won't, it will only be more
difficult to design/code/maintain.

Things get more difficult when you are using cross-platform libaries which
are available under Linux/Unix and Windows. Sometimes these libaries require
UTF-8 encoded strings. There is no easy answer for cases such as this.

It is very unfortunate that UTF-8 didn't exist when the first Unicode
standard was released. At this time the proposed encoding was UCS-2, in
which all characters were to be encoded as 16bit integers. Companies such as
Microsoft, Apple etc all engineered their OSs to be built around this
standard. It was only 2/3 years later that it became obvious that UCS-2 was
totally inadequate as an encoding standard (it can only represent 65536
code-points). The UTF-16 surrogate mechanism (read: "surrogate hack") was
introduced, and is a multibyte-format just like UTF-8.

Processing UTF-16 strings is just as tiresome as UTF-8, so there is no
longer any advantage for using UTF-16. The "1 WCHAR = 1 character" myth has
never been true - combining characters have always resulted in multibyte
sequences (even with UCS-2!) so from a technical standpoint I honestly don't
see any advantage for UTF-16. It made programming in C/C++ (the dominant
language at the time) much more difficult (wchar_t and the L"" prefix are
horrible hacks to an otherwise clean language).

The main thing to bear in mind is, very few applications actually need to
process unicode-strings. You should be using OS-provided libaries for
text-display anyway. Most of the time you just pass strings directly to an
API and let it do the work, so it doesn't make much difference what encoding
you use. Apart from the L"" and _T prefixes there should be little change to
your software.

James

--
Microsoft MVP - Windows SDK
www.catch22.net
Free Win32 Source and Tutorials

Michael (michka) Kaplan [MS]

2006-04-20 03:35:02 UTC

Post by James Brown [MVP]
Agreed - UTF-8 is a far nicer encoding to work with for the C/C++
programmer.

Every time I see this claim I wonder how anyone in their right mind can
truly believe that a multibyte encoding where any Unicode code point can use
from 1-4 bytes is actually better than one where almost all Unicode code
points in modern usage have a fixed 2-byted width and the rest fit within
four bytes in a single, consistent mechanism.

But then I realize that it is programmers who are thinking of keeping it all
in English (where UTF-8 is smaller and the code generally does not change at
all) who are saying it, and then I forgive them.

James Brown [MVP]

2006-04-20 05:44:15 UTC

Post by Michael (michka) Kaplan [MS]

Post by James Brown [MVP]
Agreed - UTF-8 is a far nicer encoding to work with for the C/C++
programmer.

Every time I see this claim I wonder how anyone in their right mind can
truly believe that a multibyte encoding where any Unicode code point can
use from 1-4 bytes is actually better than one where almost all Unicode
code points in modern usage have a fixed 2-byted width and the rest fit
within four bytes in a single, consistent mechanism.
But then I realize that it is programmers who are thinking of keeping it
all in English (where UTF-8 is smaller and the code generally does not
change at all) who are saying it, and then I forgive them.

Part of what I was saying was that UTF-8 is nicer, just because the char*
metaphore is consistant with what the majority of C programmers are used to.
But I will (respectfully) disagree with the notion that most of the time
UTF-16 can be treated as UCS-2......of course for simple string-copying this
is true, but when it comes to actually trying to pull code-points out of an
arbitrary UTF-16 string it isn't possible to just hope that one won't
encounter BMP codepoints only (unless this was an up-front design decision
of course). The surrogate mechanism always needs to be considered and this
makes UTF-16 a multibyte encoding just like UTF-8, albeit simpler to deal
with. In my experience the fact that one has to deal with surrogates at all
results in extra complexity in string-handling.

cheers,
James

--
Microsoft MVP - Windows SDK
www.catch22.net
Free Win32 Source and Tutorials

Claus Brod

2006-04-20 07:20:03 UTC

Post by James Brown [MVP]
The surrogate mechanism always needs to be considered and this
makes UTF-16 a multibyte encoding just like UTF-8, albeit simpler to deal
with.

Yup - both encodings require some degree of "multibyte-ish" handling, but at
least with UTF-8 you don't have to go through thousands of C++ files just to
change all occurrences of "char" to "TCHAR" etc.

On the other hand, older MBCS-style code often already knows about encodings
where a character has a first and a second half (such as SJIS). This is not
unlike the situation with UTF-16 where we can have either one or two words
per character. Hence, such code may even be slightly easier to adapt to
UTF-16 (after making all the formal TCHAR-style changes, of course) than to
UTF-8 where we also can have a third and fourth byte per character.

Claus

David Wilkinson

2006-04-20 10:08:46 UTC

Post by Claus Brod

Post by James Brown [MVP]
The surrogate mechanism always needs to be considered and this
makes UTF-16 a multibyte encoding just like UTF-8, albeit simpler to deal
with.

Yup - both encodings require some degree of "multibyte-ish" handling, but at
least with UTF-8 you don't have to go through thousands of C++ files just to
change all occurrences of "char" to "TCHAR" etc.
On the other hand, older MBCS-style code often already knows about encodings
where a character has a first and a second half (such as SJIS). This is not
unlike the situation with UTF-16 where we can have either one or two words
per character. Hence, such code may even be slightly easier to adapt to
UTF-16 (after making all the formal TCHAR-style changes, of course) than to
UTF-8 where we also can have a third and fourth byte per character.
Claus

Claus:

If you have an existing application where the "business logic" is all
UTF-8, why not keep it and just convert to UTF-16 every time you move a
string from the business logic to the GUI? This is what I did, and it
works fine. I just wrote myself simple classes CT2U and CU2T (based on
the VC7 ATL conversion classes) that convert back and forth between
UTF-8 narrow strings and TCHAR strings. It works in both ANSI and
UNICODE builds.

David Wilkinson

Claus Brod

2006-04-20 14:00:02 UTC

Post by David Wilkinson
If you have an existing application where the "business logic" is all
UTF-8, why not keep it and just convert to UTF-16 every time you move a
string from the business logic to the GUI?

Good advice - however, I just wish I had UTF-8 code already; what I'm
starting with is MBCS-aware code, Unix-style.

Claus

David Wilkinson

2006-04-20 14:27:23 UTC

Post by Claus Brod

Post by David Wilkinson
If you have an existing application where the "business logic" is all
UTF-8, why not keep it and just convert to UTF-16 every time you move a
string from the business logic to the GUI?

Good advice - however, I just wish I had UTF-8 code already; what I'm
starting with is MBCS-aware code, Unix-style.
Claus

Claus:

But aren't the parts that care about the encoding (mostly) the ones that
interact with the (unix) operating system? These will have to be changed
anyway.

In my case I was always in Windows, but I had a very strict separation
between the "business logic" classes and the GUI classes (or anything
that interacted with the operating system). For example, I would not let
the business logic open a file, but rather I would have a "GUI class"
open the file (using std::fstream) and pass the stream to the business
logic. Because my business logic does not do string manipulation, I did
not have to make ANY changes to the business logic when one day I simply
declared it to be using UTF-8. Also, my GUI classes always used TCHAR
and all that, so all I had to add was the conversions when moving
strings back and forth. In fact the only problem was that I missed a lot
of conversions because CString has conversion constructors which allow
(in Unicode build)

std::string narrowString = bizLogic->GetSomeString();
CString wideString = narrowString.c_str();

David Wilkinson

Claus Brod

2006-05-01 06:46:01 UTC

Post by David Wilkinson
But aren't the parts that care about the encoding (mostly) the ones that
interact with the (unix) operating system? These will have to be changed
anyway.

David,

yes and no. Most of the really "dangerous" string handling stuff is indeed
in the lower levels of the app which are designed to shield the rest of our
code from the OS and lower-level libraries. Still, innocent-looking APIs like
strcmp() are used everywhere - this alone is enough to break the app in
subtle ways when moving to UTF-8.

After a lot of soul-searching, we're now leaning towards UTF-16, despite the
higher initial effort of mechanically porting "char" to "TCHAR" etc. Thanks
to everybody for the lively and helpful discussion!

Claus

2006-04-20 16:12:03 UTC

Post by James Brown [MVP]
Part of what I was saying was that UTF-8 is nicer, just because the char*
metaphore is consistant with what the majority of C programmers are used to.
But I will (respectfully) disagree with the notion that most of the time
UTF-16 can be treated as UCS-2......of course for simple string-copying this
is true, but when it comes to actually trying to pull code-points out of an
arbitrary UTF-16 string it isn't possible to just hope that one won't
encounter BMP codepoints only (unless this was an up-front design decision
of course). The surrogate mechanism always needs to be considered and this
makes UTF-16 a multibyte encoding just like UTF-8, albeit simpler to deal
with. In my experience the fact that one has to deal with surrogates at all
results in extra complexity in string-handling.

It will be easier to spot a UTF-8 code that's not handling multi-byte properly than a UTF-16 code that's not handling surrogate pairs. You can use some French, German or Cyrillic characters, which are much easier to recognize than some rarely used Chinese characters. I'm talking from a non-Chinese-speaker point of view, of course :) Also you don't need some special font that supports the higher planes.

Ivo

Michael (michka) Kaplan [MS]

2006-04-21 14:57:49 UTC

And you would need to do the same things for UTF-8, handling the 2-byte,
3-byte, and 4-byre combinations.

So it is the same problem, but harder.
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

Post by James Brown [MVP]
Part of what I was saying was that UTF-8 is nicer, just because the char*
metaphore is consistant with what the majority of C programmers are used to.
But I will (respectfully) disagree with the notion that most of the time
UTF-16 can be treated as UCS-2......of course for simple string-copying this
is true, but when it comes to actually trying to pull code-points out of an
arbitrary UTF-16 string it isn't possible to just hope that one won't
encounter BMP codepoints only (unless this was an up-front design decision
of course). The surrogate mechanism always needs to be considered and this
makes UTF-16 a multibyte encoding just like UTF-8, albeit simpler to deal
with. In my experience the fact that one has to deal with surrogates at all
results in extra complexity in string-handling.

It will be easier to spot a UTF-8 code that's not handling multi-byte
properly than a UTF-16 code that's not handling surrogate pairs. You can use
some French, German or Cyrillic characters, which are much easier to
recognize than some rarely used Chinese characters. I'm talking from a
non-Chinese-speaker point of view, of course :) Also you don't need some
special font that supports the higher planes.

Ivo

Cristian Secara

2006-04-24 23:28:15 UTC

[...] I wonder how anyone in their right mind can truly believe that a
multibyte encoding where any Unicode code point can use from 1-4 bytes is
actually better than one where almost all Unicode code points in modern
usage have a fixed 2-byted width and the rest fit within four bytes in a
single, consistent mechanism.

It depends on application.
For example, every time I try to send a SMS message that includes accented
characters, I can't stop to blame those who have established the SMS
technical standard, because the presence of a single accented character will
trigger the whole message to fixed 2-byted width for all characters,
limiting my single message to 70 characters (instead of 160 characters).
True, this has nothing to do with Windows (not even with PC), but I consider
this to be a good example of UTF-8 advantages, sometimes ...

Cristi

Claus Brod

2007-02-24 16:43:00 UTC

Hi all,

thanks again to everyone who participated in this discussion.

We ended up choosing UTF-16 as our preferred internal encoding, and though
we cannot really tell whether or not UTF-8 would have been easier overall, I
can at least say that we did not regret our choice.

The formal migration from "char" to "TCHAR" was a very significant effort in
our codebase; without using migration tools (which we wrote for this
purpose), we wouldn't have made it in time. But after the formal changes were
done, the port became quite simple (well, in most areas at least), because
the compiler helped us in finding spots in the code which required explicit
string conversions.

Thanks again for everybody's help!

Claus

http://www.clausbrod.de/Blog

20 Replies
7 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Claus Brod 2006-04-18 20:15:02 UTC

David Wilkinson 2006-04-18 21:54:15 UTC

David Wilkinson 2006-04-18 21:54:05 UTC

David Wilkinson 2006-04-18 21:54:51 UTC

Claus Brod 2006-04-18 22:31:02 UTC

Norman Diamond 2006-04-19 02:32:10 UTC

Mihai N. 2006-04-19 06:33:52 UTC

Claus Brod 2006-04-19 08:59:02 UTC

Mihai N. 2006-04-20 06:58:14 UTC

James Brown [MVP] 2006-04-19 09:17:30 UTC

Michael (michka) Kaplan [MS] 2006-04-20 03:35:02 UTC

James Brown [MVP] 2006-04-20 05:44:15 UTC

Claus Brod 2006-04-20 07:20:03 UTC

David Wilkinson 2006-04-20 10:08:46 UTC

Claus Brod 2006-04-20 14:00:02 UTC

David Wilkinson 2006-04-20 14:27:23 UTC

Claus Brod 2006-05-01 06:46:01 UTC

2006-04-20 16:12:03 UTC

Michael (michka) Kaplan [MS] 2006-04-21 14:57:49 UTC

Cristian Secara 2006-04-24 23:28:15 UTC

Claus Brod 2007-02-24 16:43:00 UTC

about - legalese

Loading...