Programming With Unicode
Free
Programming With Unicode
By Victor Stinner
Free
Book Description

Unicode is the nightmare of many developers (and users) for different, and sometimes good reasons.

In the 1980’s, only few people read documents in languages other their mother tongue and English. A computer supported only a small number of languages, the user configured his region to support languages of close countries. Memories and disks were expensive, all applications were written to use byte strings using 8 bits encodings: one byte per character was a good compromise.

Today with the Internet and the globalization, we all read and exchange documents from everywhere around the world (even if we don’t understand everything). The problem is that documents rarely indicate their language (encoding), and displaying a document with the wrong encoding leads to a well known problem: mojibake.

It is difficult to get, or worse, guess the encoding of a document. Except for encodings of the UTF family (coming from the Unicode standard), there is no reliable algorithm for that. We have to rely on statistics to guess the most probable encoding, which is done by most Internet browsers.

Unicode support by operating systems, programming languages and libraries varies a lot. In general, the support is basic or non-existent. Each operating system manages Unicode differently. For example, Windows stores filenames as Unicode, whereas UNIX and BSD operating systems use bytes.

Mixing documents stored as bytes is possible, even if they use different encodings, but leads to mojibake. Because libraries and programs do also ignore encode and decode warnings or errors, write a single character with a diacritic (any non-ASCII character) is sometimes enough to get an error.

Full Unicode support is complex because the Unicode charset is bigger than any other charset. For example, ISO 8859-1 contains 256 code points including 191 characters, whereas Unicode version 6.0 contains 248,966 assigned code points. The Unicode standard is larger than just a charset: it explains also how to display characters (e.g. left- to-right for English and right-to-left for persian), how to normalize a character string (e.g. precomposed characters versus the decomposed form), etc.

This book explains how to sympathize with Unicode, and how you should modify your program to avoid most, or all, issues related to encodings and Unicode.

Table of Contents
  • Programming with Unicode
  • About this book
    • License
    • About this book
    • Thanks to
    • Notations
  • Unicode nightmare
  • Definitions
    • Character
    • Glyph
    • Code point
    • Character set (charset)
    • Definitions
    • Character string
    • Byte string
    • UTF-8 encoded strings and UTF-16 character strings
    • Encoding
    • Encode a character string
    • Decode a byte string
    • Mojibake
    • Unicode: an Universal Character Set (UCS)
  • Unicode
    • Unicode Character Set
    • Categories
    • Statistics
    • Normalization
    • Unicode
  • Charsets and encodings
    • Encodings
    • Popularity
    • Encodings performances
    • Examples
    • Handle undecodable bytes and unencodable characters
      • Handle undecodable bytes and unencodable characters
      • Undecodable byte sequences
      • Unencodable characters
      • Error handlers
      • Replace unencodable characters by a similar glyph
      • Escape the character
    • Charsets and encodings
    • Other charsets and encodings
  • Historical charsets and encodings
    • ASCII
    • ISO 8859 family
      • ISO 8859 family
      • ISO 8859-1
      • cp1252
      • ISO 8859-15
    • CJK: asian encodings
      • CJK: asian encodings
      • Chinese encodings
      • Japanese encodings
      • ISO 2022
      • Extended Unix Code (EUC)
    • Cyrillic
    • Historical charsets and encodings
  • Unicode encodings
    • UTF-8
    • UCS-2, UCS-4, UTF-16 and UTF-32
    • UTF-7
    • Byte order marks (BOM)
    • UTF-16 surrogate pairs
    • Unicode encodings
  • How to guess the encoding of a document?
    • Is ASCII?
    • Check for BOM markers
    • Is UTF-8?
    • Libraries
    • How to guess the encoding of a document?
  • Good practices
    • Rules
    • Unicode support levels
    • Test the Unicode support of a program
    • Get the encoding of your inputs
    • Switch from byte strings to character strings
    • Good practices
  • Operating systems
    • Windows
      • Windows
      • Code pages
      • Encode and decode functions
      • Windows API: ANSI and wide versions
      • Windows string types
      • Filenames
      • Windows console
      • File mode
    • Mac OS X
    • Locales
      • Locales
      • Locale categories
      • The C locale
      • Locale encoding
      • Locale functions
    • Filesystems (filenames)
      • Filesystems (filenames)
      • CD-ROM and DVD
      • Microsoft: FAT and NTFS filesystems
      • Apple: HFS and HFS+ filesystems
      • Others
    • Operating systems
  • Programming languages
    • C language
      • C language
      • Byte API (char)
      • Byte string API (char*)
      • Character API (wchar_t)
      • Character string API (wchar_t*)
      • printf functions family
    • C++
    • Python
      • Python
      • Python 2
      • Python 3
      • Differences between Python 2 and Python 3
      • Codecs
      • String methods
      • Filesystem
      • Windows
      • Modules
    • PHP
    • Perl
    • Java
    • Go and D
    • Programming languages
  • Libraries
    • Qt library
      • Qt library
      • Character and string classes
      • Codec
      • Filesystem
    • The glib library
      • The glib library
      • Character strings
      • Codec functions
      • Filename functions
    • iconv library
    • ICU libraries
    • libunistring
    • Libraries
  • Unicode issues
    • Security vulnerabilities
      • Security vulnerabilities
      • Special characters
      • Non-strict UTF-8 decoder: overlong byte sequences and surrogates
      • Check byte strings before decoding them to character strings
    • Unicode issues
  • See also
    No review for this book yet, be the first to review.
      No comment for this book yet, be the first to comment
      You May Also Like
      Also Available On
      App store smallGoogle play small
      Categories
      Curated Lists
      • Pattern Recognition and Machine Learning (Information Science and Statistics)
        by Christopher M. Bishop
        Data mining
        by I. H. Witten
        The Elements of Statistical Learning: Data Mining, Inference, and Prediction
        by Various
        See more...
      • CK-12 Chemistry
        by Various
        Concept Development Studies in Chemistry
        by John Hutchinson
        An Introduction to Chemistry - Atoms First
        by Mark Bishop
        See more...
      • Microsoft Word - How to Use Advanced Algebra II.doc
        by Jonathan Emmons
        Advanced Algebra II: Activities and Homework
        by Kenny Felder
        de2de
        by
        See more...
      • The Sun Who Lost His Way
        by
        Tania is a Detective
        by Kanika G
        Firenze_s-Light
        by
        See more...
      • Java 3D Programming
        by Daniel Selman
        The Java EE 6 Tutorial
        by Oracle Corporation
        JavaKid811
        by
        See more...