Programming With Unicode
Victor Stinner
Computers & Technology
Programming With Unicode
Free
Description
Contents
Reviews

Unicode is the nightmare of many developers (and users) for different, and sometimes good reasons.

In the 1980’s, only few people read documents in languages other their mother tongue and English. A computer supported only a small number of languages, the user configured his region to support languages of close countries. Memories and disks were expensive, all applications were written to use byte strings using 8 bits encodings: one byte per character was a good compromise.

Today with the Internet and the globalization, we all read and exchange documents from everywhere around the world (even if we don’t understand everything). The problem is that documents rarely indicate their language (encoding), and displaying a document with the wrong encoding leads to a well known problem: mojibake.

It is difficult to get, or worse, guess the encoding of a document. Except for encodings of the UTF family (coming from the Unicode standard), there is no reliable algorithm for that. We have to rely on statistics to guess the most probable encoding, which is done by most Internet browsers.

Unicode support by operating systems, programming languages and libraries varies a lot. In general, the support is basic or non-existent. Each operating system manages Unicode differently. For example, Windows stores filenames as Unicode, whereas UNIX and BSD operating systems use bytes.

Mixing documents stored as bytes is possible, even if they use different encodings, but leads to mojibake. Because libraries and programs do also ignore encode and decode warnings or errors, write a single character with a diacritic (any non-ASCII character) is sometimes enough to get an error.

Full Unicode support is complex because the Unicode charset is bigger than any other charset. For example, ISO 8859-1 contains 256 code points including 191 characters, whereas Unicode version 6.0 contains 248,966 assigned code points. The Unicode standard is larger than just a charset: it explains also how to display characters (e.g. left- to-right for English and right-to-left for persian), how to normalize a character string (e.g. precomposed characters versus the decomposed form), etc.

This book explains how to sympathize with Unicode, and how you should modify your program to avoid most, or all, issues related to encodings and Unicode.

Language
English
ISBN
Unknown
Programming with Unicode
About this book
License
About this book
Thanks to
Notations
Unicode nightmare
Definitions
Character
Glyph
Code point
Character set (charset)
Definitions
Character string
Byte string
UTF-8 encoded strings and UTF-16 character strings
Encoding
Encode a character string
Decode a byte string
Mojibake
Unicode: an Universal Character Set (UCS)
Unicode
Unicode Character Set
Categories
Statistics
Normalization
Unicode
Charsets and encodings
Encodings
Popularity
Encodings performances
Examples
Handle undecodable bytes and unencodable characters
Handle undecodable bytes and unencodable characters
Undecodable byte sequences
Unencodable characters
Error handlers
Replace unencodable characters by a similar glyph
Escape the character
Charsets and encodings
Other charsets and encodings
Historical charsets and encodings
ASCII
ISO 8859 family
ISO 8859 family
ISO 8859-1
cp1252
ISO 8859-15
CJK: asian encodings
CJK: asian encodings
Chinese encodings
Japanese encodings
ISO 2022
Extended Unix Code (EUC)
Cyrillic
Historical charsets and encodings
Unicode encodings
UTF-8
UCS-2, UCS-4, UTF-16 and UTF-32
UTF-7
Byte order marks (BOM)
UTF-16 surrogate pairs
Unicode encodings
How to guess the encoding of a document?
Is ASCII?
Check for BOM markers
Is UTF-8?
Libraries
How to guess the encoding of a document?
Good practices
Rules
Unicode support levels
Test the Unicode support of a program
Get the encoding of your inputs
Switch from byte strings to character strings
Good practices
Operating systems
Windows
Windows
Code pages
Encode and decode functions
Windows API: ANSI and wide versions
Windows string types
Filenames
Windows console
File mode
Mac OS X
Locales
Locales
Locale categories
The C locale
Locale encoding
Locale functions
Filesystems (filenames)
Filesystems (filenames)
CD-ROM and DVD
Microsoft: FAT and NTFS filesystems
Apple: HFS and HFS+ filesystems
Others
Operating systems
Programming languages
C language
C language
Byte API (char)
Byte string API (char*)
Character API (wchar_t)
Character string API (wchar_t*)
printf functions family
C++
Python
Python
Python 2
Python 3
Differences between Python 2 and Python 3
Codecs
String methods
Filesystem
Windows
Modules
PHP
Perl
Java
Go and D
Programming languages
Libraries
Qt library
Qt library
Character and string classes
Codec
Filesystem
The glib library
The glib library
Character strings
Codec functions
Filename functions
iconv library
ICU libraries
libunistring
Libraries
Unicode issues
Security vulnerabilities
Security vulnerabilities
Special characters
Non-strict UTF-8 decoder: overlong byte sequences and surrogates
Check byte strings before decoding them to character strings
Unicode issues
See also
The book hasn't received reviews yet.