c++ - How well is Unicode supported in C++11? -

i've read , heard c++11 supports unicode. few questions on that:

how c++ standard library support unicode?
does std::string should?
how use it?
where potential problems?

how c++ standard library support unicode?

terribly.

a quick scan through library facilities might provide unicode support gives me list:

strings library
localization library
input/output library
regular expressions library

i think first 1 provide terrible support. i'll in more detail after quick detour through other questions.

does std::string should?

yes. according c++ standard, std::string , siblings should do:

the class template basic_string describes objects can store sequence consisting of varying number of arbitrary char-like objects first element of sequence @ position zero.

well, std::string fine. provide unicode-specific functionality? no.

should it? not. std::string fine sequence of char objects. that's useful; annoyance is low-level view of text , standard c++ doesn't provide higher-level one.

how use it?

use sequence of char objects; pretending else bound end in pain.

where potential problems?

all on place? let's see...

strings library

the strings library provides basic_string, merely sequence of standard calls "char-like objects". call them code units. if want high-level view of text, not looking for. view of text suitable serialization/deserialization/storage.

it provides tools c library can used bridge gap between narrow world , unicode world: c16rtomb/mbrtoc16 , c32rtomb/mbrtoc32.

localization library

the localization library still believes 1 of "char-like objects" equals 1 "character". of course silly, , makes impossible lots of things working beyond small subset of unicode ascii.

consider, example, standard calls "convenience interfaces" in <locale> header:

template <class chart> bool isspace (chart c, const locale& loc); template <class chart> bool isprint (chart c, const locale& loc); template <class chart> bool iscntrl (chart c, const locale& loc); // ... template <class chart> chart toupper(chart c, const locale& loc); template <class chart> chart tolower(chart c, const locale& loc); // ...

how expect of these functions categorize, say, u+1f34c ʙᴀɴᴀɴᴀ, in u8"🍌" or u8"\u0001f34c"? there's no way ever work, because functions take 1 code unit input.

this work appropriate locale if used char32_t only: u'\u0001f34c' single code unit in utf-32.

however, still means simple casing transformations toupper , tolower, which, example, not enough german locales: "ß" uppercases "ss" toupper can return 1 ~~character~~ code unit.

next up, wstring_convert/wbuffer_convert , standard code conversion facets.

wstring_convert used convert between strings in 1 given encoding strings in given encoding. there 2 string types involved in transformation, standard calls byte string , wide string. since these terms misleading, prefer use "serialized" , "deserialized", respectively, instead^†.

the encodings convert between decided codecvt (a code conversion facet) passed template type argument wstring_convert.

wbuffer_convert performs similar function ~~wide~~ deserialized stream buffer wraps ~~byte~~ serialized stream buffer. i/o performed through underlying ~~byte~~ serialized stream buffer conversions , encodings given codecvt argument. writing serializes buffer, , writes it, , reading reads buffer , deserializes it.

the standard provides codecvt class templates use these facilities: codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16, , codecvt specializations. these standard facets provide following conversions. (note: in following list, encoding on left serialized string/streambuf, , encoding on right deserialized string/streambuf; standard allows conversions in both directions).

utf-8 ↔ ucs-2 codecvt_utf8<char16_t>, , codecvt_utf8<wchar_t> sizeof(wchar_t) == 2;
utf-8 ↔ utf-32 codecvt_utf8<char32_t>, codecvt<char32_t, char, mbstate_t>, , codecvt_utf8<wchar_t> sizeof(wchar_t) == 4;
utf-16 ↔ ucs-2 codecvt_utf16<char16_t>, , codecvt_utf16<wchar_t> sizeof(wchar_t) == 2;
utf-16 ↔ utf-32 codecvt_utf16<char32_t>, , codecvt_utf16<wchar_t> sizeof(wchar_t) == 4;
utf-8 ↔ utf-16 codecvt_utf8_utf16<char16_t>, codecvt<char16_t, char, mbstate_t>, , codecvt_utf8_utf16<wchar_t> sizeof(wchar_t) == 2;
narrow ↔ wide codecvt<wchar_t, char_t, mbstate_t>
no-op codecvt<char, char, mbstate_t>.

several of these useful, there lot of awkward stuff here.

first off—holy high surrogate! naming scheme messy.

then, there's lot of ucs-2 support. ucs-2 encoding unicode 1.0 superseded in 1996 because supports basic multilingual plane. why committee thought desirable focus on encoding superseded on 20 years ago, don't know^‡. it's not support more encodings bad or anything, ucs-2 shows here.

i char16_t meant storing utf-16 code units. however, 1 part of standard thinks otherwise. codecvt_utf8<char16_t> has nothing utf-16. example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\u0001f34c") compile fine, fail unconditionally: input treated ucs-2 string u"\xd83c\xdf4c", cannot converted utf-8 because utf-8 cannot encode value in range 0xd800-0xdfff.

still on ucs-2 front, there no way read utf-16 stream utf-16 string these facets. if have sequence of utf-16 bytes can't deserialize string of char16_t, example. surprising, because more or less identity conversion. more suprising, though, fact there support deserializing utf-16 stream ucs-2 string codecvt_utf16<char16_t>, lossy conversion.

the utf-16-as-bytes support quite good, though: supports detecting endianess bom, or selecting explicitly in code. supports producing output , without bom.

there more interesting conversion possibilities absent. there no way deserialize utf-16 stream or string utf-8 string, since utf-8 never supported deserialized form.

and here narrow/wide world separate utf/ucs world. there no conversions between old-style narrow/wide encodings , unicode encodings.

input/output library

the i/o library can used read , write text in unicode encodings using wstring_convert , wbuffer_convert facilities described above. don't think there's else need supported part of standard library.

regular expressions library

i have expounded upon problems c++ regexes , unicode on stack overflow before. not repeat points here, merely state c++ regexes don't have level 1 unicode support, bare minimum make them usable without resorting using utf-32 everywhere.

that's it?

yes, that's it. that's existing functionality. there's lots of unicode functionality seen normalization or text segmentation algorithms.

u+1f4a9. there way better unicode support in c++?

the usual suspects: icu , boost.locale.

^† byte string is, unsurprisingly, string of bytes, i.e., char objects. however, unlike wide string literal, array of wchar_t objects, "wide string" in context not string of wchar_t objects. in fact, standard never explicitly defines "wide string" means, we're left guess meaning usage. since standard terminology sloppy , confusing, use own, in name of clarity.

encodings utf-16 can stored sequences of char16_t, have no endianness; or can stored sequences of bytes, have endianness (each consecutive pair of bytes can represent different char16_t value depending on endianness). standard supports both of these forms. sequence of char16_t more useful internal manipulation in program. sequence of bytes way exchange such strings external world. terms i'll use instead of "byte" , "wide" "serialized" , "deserialized".

^‡ if "but windows!" hold 🐎🐎. versions of windows since windows 2000 use utf-16.

Club Open

Search This Blog

c++ - How well is Unicode supported in C++11? -

Comments

Post a Comment