Sunday, March 16, 2014

Cramming on Unicode


I floated the idea of having Unicode a theme for OST during OSCON this year, with OST a subdivision of ORM.  That got me cramming on Safari, plus I've been looking over Holden's shoulder as he blasts a set of I-Python Notebooks out to Amazon for review, some of which focus on Python 3.x's byte, bytearray, and str type objects.  That's my focus here.

To recap:  I have a somewhat roller coaster like curriculum that gives both an encouraging and a grim look at humans and their history.

The story of Unicode, its development, is more or less a story of collaboration against the odds, laying a kind of Tower of Babel foundation, but without the intent to build toward a pinnacle, with one language winning out.  On the contrary, there's still room for entirely new languages.  This was forsightful planning and so an encouraging story.

The negative dip into grim times is the rounding up of peoples in extermination camps, working them to death in poor conditions, with "keeping tabs" using "Hollerith machines" by IBM, the beginnings of our vast databases, both SQL and noSQL.  Using computers to hunt down and destroy entire ethnicities, to commit genocide, is one of those dark patterns, as it keeps happening in history and engineering has served to amplify and intensify the pattern's efficiency and viciousness.

Back to the Unicode story, UTF-8 is what saved its bacon, as ASCII-users were not about to bloat their files with little payback.  But then we should remember about patient names and the ability of Unicode to represent a patient's name in a native language on the monitors, perhaps with a romanized phonetic reading ("romanji") for the nurses and doctors.  Unicode lets you display fluency by quoting multiple languages in the same document.

In UTF-8, the boundary between ASCII proper and the encompassing Latin-1 is at code point 128.  With the first bit now occupied, two will be needed (at minimum) from now on, and the leading byte will show 110, 1110, 11110, 111110, 1111110 indicating up to "six cars total" (including the "engine" or leading byte).

Like a train of three bytes would go:  1110 0001 + 1010 0000 + 1011 0000 where I'm using + to separate the bit patterns.  Payload bits would be the xs in 1110 xxxx + 10xx xxxx + 10xx xxxx i.e. there's room for 16 payload bits for a total of 2**16 or 65536 code points, all within in reach of this three byte encoding, with more bytes waiting in the wings.

What is 0001 10 0000 11 0000 as a decimal number?  Unicode is just a consecutive numbering of a huge inventory of font-provided glyphs.  Turns out its 6192, which happens to be the Mongolian letter sa.