c++ - How well is Unicode supported in C++11? -


i've read , heard c++11 supports unicode. few questions on that:

  • how c++ standard library support unicode?
  • does std::string should?
  • how use it?
  • where potential problems?

how c++ standard library support unicode?

terribly.

a quick scan through library facilities might provide unicode support gives me list:

  • strings library
  • localization library
  • input/output library
  • regular expressions library

i think first 1 provide terrible support. i'll in more detail after quick detour through other questions.

does std::string should?

yes. according c++ standard, std::string , siblings should do:

the class template basic_string describes objects can store sequence consisting of varying number of arbitrary char-like objects first element of sequence @ position zero.

well, std::string fine. provide unicode-specific functionality? no.

should it? not. std::string fine sequence of char objects. that's useful; annoyance is low-level view of text , standard c++ doesn't provide higher-level one.

how use it?

use sequence of char objects; pretending else bound end in pain.

where potential problems?

all on place? let's see...

strings library

the strings library provides basic_string, merely sequence of standard calls "char-like objects". call them code units. if want high-level view of text, not looking for. view of text suitable serialization/deserialization/storage.

it provides tools c library can used bridge gap between narrow world , unicode world: c16rtomb/mbrtoc16 , c32rtomb/mbrtoc32.

localization library

the localization library still believes 1 of "char-like objects" equals 1 "character". of course silly, , makes impossible lots of things working beyond small subset of unicode ascii.

consider, example, standard calls "convenience interfaces" in <locale> header:

template <class chart> bool isspace (chart c, const locale& loc); template <class chart> bool isprint (chart c, const locale& loc); template <class chart> bool iscntrl (chart c, const locale& loc); // ... template <class chart> chart toupper(chart c, const locale& loc); template <class chart> chart tolower(chart c, const locale& loc); // ... 

how expect of these functions categorize, say, u+1f34c ʙᴀɴᴀɴᴀ, in u8"🍌" or u8"\u0001f34c"? there's no way ever work, because functions take 1 code unit input.

this work appropriate locale if used char32_t only: u'\u0001f34c' single code unit in utf-32.

however, still means simple casing transformations toupper , tolower, which, example, not enough german locales: "ß" uppercases "ss" toupper can return 1 character code unit.

next up, wstring_convert/wbuffer_convert , standard code conversion facets.

wstring_convert used convert between strings in 1 given encoding strings in given encoding. there 2 string types involved in transformation, standard calls byte string , wide string. since these terms misleading, prefer use "serialized" , "deserialized", respectively, instead.

the encodings convert between decided codecvt (a code conversion facet) passed template type argument wstring_convert.

wbuffer_convert performs similar function wide deserialized stream buffer wraps byte serialized stream buffer. i/o performed through underlying byte serialized stream buffer conversions , encodings given codecvt argument. writing serializes buffer, , writes it, , reading reads buffer , deserializes it.

the standard provides codecvt class templates use these facilities: codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16, , codecvt specializations. these standard facets provide following conversions. (note: in following list, encoding on left serialized string/streambuf, , encoding on right deserialized string/streambuf; standard allows conversions in both directions).

  • utf-8 ↔ ucs-2 codecvt_utf8<char16_t>, , codecvt_utf8<wchar_t> sizeof(wchar_t) == 2;
  • utf-8 ↔ utf-32 codecvt_utf8<char32_t>, codecvt<char32_t, char, mbstate_t>, , codecvt_utf8<wchar_t> sizeof(wchar_t) == 4;
  • utf-16 ↔ ucs-2 codecvt_utf16<char16_t>, , codecvt_utf16<wchar_t> sizeof(wchar_t) == 2;
  • utf-16 ↔ utf-32 codecvt_utf16<char32_t>, , codecvt_utf16<wchar_t> sizeof(wchar_t) == 4;
  • utf-8 ↔ utf-16 codecvt_utf8_utf16<char16_t>, codecvt<char16_t, char, mbstate_t>, , codecvt_utf8_utf16<wchar_t> sizeof(wchar_t) == 2;
  • narrow ↔ wide codecvt<wchar_t, char_t, mbstate_t>
  • no-op codecvt<char, char, mbstate_t>.

several of these useful, there lot of awkward stuff here.

first off—holy high surrogate! naming scheme messy.

then, there's lot of ucs-2 support. ucs-2 encoding unicode 1.0 superseded in 1996 because supports basic multilingual plane. why committee thought desirable focus on encoding superseded on 20 years ago, don't know. it's not support more encodings bad or anything, ucs-2 shows here.

i char16_t meant storing utf-16 code units. however, 1 part of standard thinks otherwise. codecvt_utf8<char16_t> has nothing utf-16. example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\u0001f34c") compile fine, fail unconditionally: input treated ucs-2 string u"\xd83c\xdf4c", cannot converted utf-8 because utf-8 cannot encode value in range 0xd800-0xdfff.

still on ucs-2 front, there no way read utf-16 stream utf-16 string these facets. if have sequence of utf-16 bytes can't deserialize string of char16_t, example. surprising, because more or less identity conversion. more suprising, though, fact there support deserializing utf-16 stream ucs-2 string codecvt_utf16<char16_t>, lossy conversion.

the utf-16-as-bytes support quite good, though: supports detecting endianess bom, or selecting explicitly in code. supports producing output , without bom.

there more interesting conversion possibilities absent. there no way deserialize utf-16 stream or string utf-8 string, since utf-8 never supported deserialized form.

and here narrow/wide world separate utf/ucs world. there no conversions between old-style narrow/wide encodings , unicode encodings.

input/output library

the i/o library can used read , write text in unicode encodings using wstring_convert , wbuffer_convert facilities described above. don't think there's else need supported part of standard library.

regular expressions library

i have expounded upon problems c++ regexes , unicode on stack overflow before. not repeat points here, merely state c++ regexes don't have level 1 unicode support, bare minimum make them usable without resorting using utf-32 everywhere.

that's it?

yes, that's it. that's existing functionality. there's lots of unicode functionality seen normalization or text segmentation algorithms.

u+1f4a9. there way better unicode support in c++?

the usual suspects: icu , boost.locale.


byte string is, unsurprisingly, string of bytes, i.e., char objects. however, unlike wide string literal, array of wchar_t objects, "wide string" in context not string of wchar_t objects. in fact, standard never explicitly defines "wide string" means, we're left guess meaning usage. since standard terminology sloppy , confusing, use own, in name of clarity.

encodings utf-16 can stored sequences of char16_t, have no endianness; or can stored sequences of bytes, have endianness (each consecutive pair of bytes can represent different char16_t value depending on endianness). standard supports both of these forms. sequence of char16_t more useful internal manipulation in program. sequence of bytes way exchange such strings external world. terms i'll use instead of "byte" , "wide" "serialized" , "deserialized".

if "but windows!" hold 🐎🐎. versions of windows since windows 2000 use utf-16.


Comments