Charsets

Character-Sets -- by Eugene Reimer 2009-June

Once upon a time computers used ASCII characters. Each ASCII character-code was a 7-bit number, so these characters could fit comfortably into the 8-bit bytes that soon became the norm. Unfortunately using a 7-bit code meant a repertoire of at most 128 symbols, and since some of the 128 possible values were used for control functions such as Newline, the number of graphic symbols was actually restricted to 95. Incidentally, if you're using a PC with the US-keyboard as almost everyone in the English-speaking parts of the world is, then you'll notice that the characters you can type with simple keystrokes are exactly those 95 ASCII characters. However if you sometimes write in a language other than English then you'll probably have a different keyboard, one that makes it easy to enter the sorts of characters your other language(s) need, whether those be accented Roman letters, Cyrillic letters, Greek letters, etc.

And that brings us to the need for a larger repertoire of symbols than was provided by ASCII, and this is where the story soon forks into far too many different directions. First came ISO-8859-1, an 8-bit extension of ASCII, also known as Western-European and as Latin1. (Actually there were created a whole family of such 8-bit extensions of ASCII, such as ISO-8859-2 for Central-European languages, ISO-8859-5 for the Cyrillic alphabet, but I'm going to be Western-European-centric from here on.)

Shortly after the International Standards people created their 8-bit alphabet for Western-European languages, Microsoft made its own, a slight variation of ISO-8859-1. The Microsoft variant had all the same characters plus a few more. Microsoft added things like left- and right-quote symbols, en- and em-dash, and a few others. This was possible because the ISO people had decided to reserve 32 of the additional 128 values (going from 7- to 8-bits gave us another 128 values) for additional control functions. Despite the fact that what the world wanted and wanted badly was more symbols, not more control functions! (Almost all of the original 33 control functions had fallen into disuse, because whenever more than the basics were needed then hundreds or thousands were needed, and the solution was the use of escape sequences where a single ASCII value known as ESC is used to begin an arbitrarily long string of control information; according to the W3C only 3 of the 33 ASCII-Control-Characters are ever used!) In short, the ISO got it wrong.

These things were happening around the time that the world-wide-web (http) was in its infancy. The standards specified that ISO-8859-1 was the character-set of the www. Microsoft products ignored this, calmly producing webpages in cp1252 (their version of Western-European), sometimes even pretending that iso8859-1 was another name for cp1252. What ensued was years of confusion, where Microsoft-software-produced webpages were typically misrendered by non-Microsoft browsers, where those of us using standards-compliant software were forever losing characters that were Microsoft-only values when saving a file from an MS-user. All rather sad, all so easily preventable.

[Incidentally email, which predates the world-wide-web by several decades, is still ASCII-only. And that's why when trying to send an 8-megabyte image, you're told you've exceeded the 10-megabyte limit. That seeming nonsense becomes comprehensible once you realize that non-text attachments are undergoing an encoding that uses only 6-bits in each 8-bit byte, in other words such attachments get bigger by an 8:6 ratio while being transmitted over an email channel.]

Then came Unicode, an ambitious undertaking to unify all the character-sets (alphabets) being used on this planet. Sadly their initial publications while strong on abstract concepts were weak on practical details. They seemed to think one could wave a magic wand and all ones old text-files would be magically converted to using two bytes per character. Not only is there a problem determining which files are text and which aren't, but some text-files exist on floppies, CDs, etc. Most people just ignored Unicode at that time, although I do recall shuddering when looking at the Postscript produced by one of my Netscape browsers, where a string like "hi" suddenly became "\000\150\000\151" in a misguided attempt to make things better.

Unicode had two major obstacles to overcome before it could gain acceptance. One was the simple fact that an ASCII text-file had to remain a text-file. The other was that all the character-processing software in existence had to be modified before being able to deal with these new characters. This was going to take time, and most of us in the English-speaking community were content to sit back and wait, while those using other languages thus having a more pressing need for Unicode suffered with buggy incomplete software.

Before the Unicode approach was altogether ripe, the Europeans created a new currency, which led to a new symbol. ISO-8859-1 could not cope with a new symbol, not without giving up that silliness about 32 of the hi-bit characters being reserved for those unwanted future control characters, and this they were unwilling to do. Heaven forbid, down that path lay accusations of having caved in to Microsoft. And so, behaving more like power-tripping bureaucrats than like people with a lick of sense, they gave us ISO-8859-15, an incompatible replacement for 8859-1, where a bunch of symbols had been dropped and new ones including the Euro-symbol had been added. Microsoft meanwhile had no such problem -- they had left a few unused values in their cp1252 encoding scheme, so one of those now became the Euro-symbol, and cp1252-users lived happily on, although probably not for ever after:-)

Switching to ISO-8859-15 was so repugnant that many people switched to Unicode instead. By this time we had the UTF-8 flavour of Unicode, in which at least the ASCII-textfile-remains-a-valid-textfile requirement was met. (Sadly UTF-8 had been too long in coming, and in the meantime the requirements, for many of us, had changed: now what was needed was that iso8859-1-textfiles remain valid textfiles, and this utf8 did not provide.)

Meanwhile I had been using mostly pure ASCII, although occasionally as when writing in non-English Germanic languages I'd used 8859-1. On webpages, by the way, it was easy to use a much larger repertoire of symbols even while using nothing but pure ASCII in ones files, since HTML provides entities where, for example, one can type "ä" into ones HTML file and that will be rendered as an umlauted-letter-A (ä).

But having created some (non-ASCII) 8859-1 files before switching to Unicode, that meant switching would not be easy, and I still haven't. When I recently realized that I had to switch to something other than 8859-1, then after getting over being ashamed of myself for having left these things in such an unsatisfactory state for so long, I considered the alternatives. Going to UTF-8 was going to be tough and seemed like overkill; going to 8859-15 was repugnant because it would be tough and yet woefully incomplete. The answer was obvious: cp1252. It is everything that 8859-15 ought to have been but wasn't, a pain-free way for people with 8859-1 files to get additional symbols like the Euro, em-dash, etc, AND have their existing files remain valid files.

How to use the Microsoft cp1252 Character-set on Linux

If you are in a similar situation, having been slow to switch to utf8, having existing files in iso8859-1, and being a user of Linux, then I've got the answer you've been seeking: how to use the Microsoft cp1252 Character-set on Linux.

Character-sets, by the way, are also known as Charsets, Character-encodings or simply Encodings. Both IBM and Microsoft used to call them Code-Pages, a term that sounds so deeply mysterious they were automatically avoided by all but the bravest of bit-twiddling geeks:-) And so you see that "cp1252" is short for Windows-Code-Page-number-1252, and whatever name you know it by it is the answer to our iso-8859 blues. (One other bit of mystical-sounding jargon you'll come across, BTW, is "i18n" which was somebody's cute way to abbreviate "internationalization", a long word that begins with "i", ends with "n", has 18 letters in between.)

The people who put together Linux distributions do not make it easy to use this Character-set that comes from Microsoft. Some argue that including a locale with a rarely needed character-set would needlessly bloat their distribution. That is bull, by the way, since if avoiding bloat due to useless junk were an objective they never would've included iso8859-15, something so entirely without merit as to have negative value. It is easy to understand why Penguins are a tad suspicious about so-called innovation from Microsoft, but in this case if one looks with open eyes one sees that Code-Page-1252 is a good thing. I'm still trying to understand why it was so hard to find information on using cp1252 on Linux. Well, I began by looking in all the wrong places, then the googling was tougher than expected due to false-hits, everything involving character-sets having way too many synonyms, and my not even having a clear idea of what I was seeking. Once I understood that what I wanted was to find "en_US.CP1252" then the finding became if not easy at least possible. Not that my Linux has such a thing, not that any Linux has such a thing, not that anyone on the whole world-wide-web provides such a thing, but the www does provide the instructions for making such, although none that I saw will actually work as written, so here they are again in a form that actually works on my Linux system:

	DIR=/usr/share/i18n			##will differ across Linux distributions
	DST=/usr/lib/locale			##may differ across Linux distributions
	cp $DIR/charmaps/CP1252.gz /tmp;  cd /tmp;  gzip -d CP1252.gz		##because localedef can't handle gzipped charmaps
	sudo localedef -f /tmp/CP1252 -i $DIR/locales/en_US  $DST/en_US.CP1252	##construct and install the locale en_US.CP1252

In case you're having trouble getting those steps to work, here are the files needed for that en_US.CP1252 directory: en_US.CP1252.zip.

Once you have created or installed that locale called en_US.CP1252, then you can proceed to use it in your LANG, LC_COLLATE, LC_CTYPE, etc settings. Those are best defined in /etc/sysconfig/i18n incidentally, since that way they'll be in effect at boot-time, and at xterm/konsole-startup (important), as well as after the execution of /etc/profile. Of course your distribution may use different file- and directory-names. Here's what I'm currently using for those, in /etc/sysconfig/i18n:

	LANG=en_CA				##ER: Canadian-English for the unspecified things
	LC_COLLATE=C				##ER: traditional-Unix ordering on Sort
	LC_CTYPE=en_US.CP1252			##ER: want anything.CP1252; there wasn't such until I made one
	LC_TIME=en_DK				##ER: Danish-English gives me YYYY-MM-DD dates and 24-hr time
	SYSFONT=LatArCyrHeb-08			##ER: 2005nov13: solution to fontsize changes during boot (when LANG=en_US.UTF-8)

How to switch to utf8 - for those with files in different 8-bit encodings

Shortly after writing the above I asked myself why I was being such an old stick-in-the-mud about going to utf8. Just because the extra cost of handling multi-byte characters is mind-boggling to an old-fart computer-programmer from the days when the biggest mainframe had less processing power than a typical kitchen appliance has these days, not to mention having spent a year of my life carefully crafting an 8080-assembler-language program for a "smart" terminal that did a lot of interesting stuff but had to fit into 20KB of ROM (that "DTX" was wonderful, also very nearly obsolete the day it was finished since PCs had just appeared that would soon make such mainframe-connected "terminals" obsolete)...

The main difficulty with switching to utf8 is that, having left it so long, it isn't easy to figure out which of my files need to undergo which conversion. Most of my text-files are in pure-ASCII; however some are in cp1252, many in the iso8859-1 subset of cp1252, a few are already in utf8, a few in various non-western encodings, and horror-of-horrors I even have some in iso8859-15. Once having taken the plunge and reconfigured my xterm/konsole and text-editor to work in utf8, then I'll much prefer having all textfiles in utf8, and in the interest of keeping things simple, I intend to configure (via .htaccess) the websites I look after to serve all webpages as utf8. Which means I need to identify files that are text and contain non-ASCII non-UTF-8 characters; furthermore I need to be able to deduce which encoding each such file is using in order to be able to correctly convert it to utf8. Doing this in general is impossible, and that's my excuse for having left this job for so long. However one can say a few things with a fair degree of confidence, for example a file with hi-bit characters that passes the "valid-utf8" test is probably in utf8. In my case, the only files in non-western encodings are "language-pack" files in my multi-lingual software, and have the encoding in the filename. And most (hopefully all) of my files in the abominable iso8859-15 encoding are HTML and contain '<META...CONTENT="text/html; charset=iso-8859-15">' making them easily identified.

So that leaves only cp1252 files to be identified and converted. Sounds easy, and is except for the danger of some such files being mis-diagnosed as being UTF-8. [Digression: this reminds me of an interesting puzzle I once set myself to unravel: an eminent botanist had sent me a large document consisting of botanical names that was clearly in utf8 and yet most of the non-ASCII characters in it seemed highly improbable. By looking at a name I knew, where a character that ought to be an "e-acute" had become a "SINGLE LOW-9 QUOTATION MARK", I could deduce what had gone wrong with this document: it had been in IBM codepage-850, but had undergone a cp1252-to-utf8 conversion. In other words it merely needed a utf8-to-cp1252 conversion to become perfectly sensible cp850 text. The deduction will be easy to follow if you look at the binary values for the characters involved; impossible otherwise.]

My first thought was to use the Linux file-command to identify the cp1252/iso8859-1 files that need conversion. However it has several deficiencies, some of which are serious:
• some of my small textfiles of English prose are being misdiagnosed as being FORTRAN programs;
• some types of text-files are not diagnosed as "text", including FORTRAN-program, RTF, and SVG files;
• although HTML files are diagnosed as "text", many of them are given no info on the encoding they're in (only for those containing a DOCTYPE is encoding-info given?)
• I also have English prose being misdiagnosed as Pascal program, Javascript program as C++ program, Postscript fragment as C program, but none of those are serious since they're still "text" and still get the encoding-info; however ordinary prose textfiles being misdiagnosed as MPEG-4 LOAS or as BOA archive data is more serious, and more surprising, although less frequent;
• some minor points that will affect very few people: files created in early versions of MS-DOS have a trailing Ctrl-Z (hex 1A) which results in such files being classified as "data" rather than "text"; similarly an ASCII file containing a VT-control-char (hex 0B) is misclassified as "data";
• the file-command will sometimes assure you that a file is "ASCII text" even though it isn't, on a large file where the first part is ASCII;
• a file in the "big5" Traditional-Chinese encoding can be misdiagnosed as "ISO-8859";
• a file in iso8859-1 or in cp1252 can be misdiagnosed as "UTF-8";

My tools to help in identifying which files need which conversion:
find-anomalous-textfiles -- gets around file-command deficiencies;
find-unprintable-meaning-nonASCII -- display non-ascii characters, in hex using less, and/or in a browser;
diff-by-charset -- show how a file differs when interpreted in two different charsets;
charsetdetective -- the solution to distinguishing between hard-to-distinguish charsets;
cvt-textfiles-to-utf8-charset -- does mass-conversion of files to utf8, with backups.

To switch to utf8 your /etc/sysconfig/i18n file specifies:

	LC_CTYPE=en_US.UTF-8

Note that some things will work badly if you get into a state where your xterm/konsole is set for a different encoding than specified by your LC_CTYPE. By setting LC_CTYPE in /etc/sysconfig/i18n you'll avoid that.

If you want keyboard shortcuts for entering accented or other non-ASCII characters see xmodmap and ~/.Xmodmap. These are specified in such a way that they're immune to LC_CTYPE changes. When making my .Xmodmap I finally found a use for those "window" and "menu" keys:-)

Problems with switching to utf8

One problem: spurious message from bash saying "cannot execute binary file" on some shell-scripts that contain non-ASCII utf8 character(s) and lack a shebang ("#!") first line. Since adding a shebang line is a solution, and is probably good practice, notwithstanding the portability problem inherent in using a hard-coded fully-qualified pathname, this is only a minor irritant.

A more serious problem: despite being ashamed of myself for having left this so long, turns out I didn't wait long enough. In late-2009, GNU Coreutils still does not support Unicode (multibyte) characters! Only after having converted all my textfiles did I learn this, and now I don't know what to do. One common idiom in my text-processing shell-scripts is using tr to translate newline ('\n') to some other character, with sed removing and/or inserting some of those characters, then translating them back to newline. My usual choice for "other character" is n-tilde. This idiom is completely broken when working in utf8. The problem: tr gives me a Latin1 n-tilde, sed expects a utf8 n-tilde. One Coreutils program supporting utf8 the other not seems to make things even more broken than if none did?

Will I feel forced to reconvert all my textfiles back to cp1252?

Or can I find another choice for "other character" in that idiom? Using the broken tr on utf8 files is dangerous, as it will lead to mangled characters, however one can safely use it to operate on ASCII-characters. Some possibly suitable candidates for "other character" are \v and \x01.

Or can I find a way to do with sed what I've been doing with tr? Sed has its own translate command but it works on lines so cannot be used to join lines. The sed "N" command can be but gets so mind-boggling that I had always rejected it in favour of using tr. Here are the tr and sed cmdlines for comparison:

	tr '\n' 'ñ'
	sed ':a $!{N;ba}; s/\n/ñ/g'

When considering the entire idiom the sed approach fairs less badly since we needn't, and never really wanted to, replace NL-characters:

	tr '\n' 'ñ' |sed ... |tr 'ñ' '\n'
	sed ':a $!{N;ba}; ...'

The "sed ..." part will now operate on '\n' rather than on 'ñ', and thus that part actually improves in comprehensibility. Note that any '\n' introduced with sed commands receives the same quirky treatment as the ones made visible with the N-command; ie: it is not treated as a line-ender (not until sed is forced to reread the file). This oft-times annoying quirk is actually good news when using this idiom, as it ensures that the entire file keeps on behaving like a single line.

Other uses of tr can also be eliminated by switching to sed-based replacements that are often improvements rather than equivalents; for example, here are two cmdlines for lowercasing:

	tr A-Z a-z
	sed 's/.*/\L&/'

Incidentally I'm still reluctant to recommend converting to UTF-8, because the support for it remains patchy and incomplete. If the character-set of cp1252 suffices then it may well be a better choice. However if tr turns out to be the only important program that lacks utf8-support, then I'll be happy with having switched to utf8 and with the sed-based workarounds for tr.

One other problem I encountered is so bizarre I can hardly believe it: the so-called "Monospace" font which is konsole's default is NOT Monospaced when some unusual characters are present! I've made no attempt to comprehend what's happening; since I have long longed for a more programmer-friendly font, I tried switching to Bitstream-Vera-Sans-Mono which solves the non-monospaced problem, has more easily distinguishable shapes for the near lookalike characters, and seems an all-round improvement. However I'm still shaking my head: how could anyone so completely miss the most fundamental property of Monospaced (aka Fixed-pitch, aka Typewriter) fonts?

Problems with mixed-charset data

Sometimes one is forced to work with a file that contains strings in several encodings. For example, my Apacbe log-records contain strings in utf8, in koi8-r, in at least one of the Chinese encodings -- seemingly in whatever the visitor was using. There is no simple solution for getting something sensible out of such data, however judicious use of LC_CTYPE=C can be used to solve such problems. By running a character-oriented command with LC_CTYPE=C in effect one can work with bytes rather than characters.

My version of wc complains in annoying detail about every single case of "invalid character", but can be kept quiet by using LC_CTYPE=C which also means a byte-count result rather than a character-count. (When wc is used purely for the line-count as I often do, the complaining seems especially inappropriate.) My uniq is broken wrt mixed-charset data: as used in weblog-search-strings-report it fails to combine identical koi8 strings that are admittedly invalid in my default LC-CTYPE (utf8); turns out that specifying LC_CTYPE=C gets around this quirk. And whereas wc complains too much, uniq complains not at all making its behaviour come as a bigger surprise.

Life was simple in the good old days when everything was ASCII, but on first hearing that ominous word "codepage" we knew life would never be simple again:-) Of course the good old days of my youth weren't really that idyllic since some computers used EBCDIC, others ASCII, and some still used BCDIC the 6-bit charset that EBCDIC was based on. Furthermore BCDIC and EBCDIC each came in several variants. And when I was working at a CDC-shop (Chalk River Nuclear Laboratories) in 1974, a new release of CDC's operating-system went from one BCDIC-variant to a slightly different one, and this seemingly small change, going from a 63-valued to a 64-valued charset, led to much confusion and consternation. Mind you, that change was as mind-boggling as needing to modify all C-programs in Unix to permit character-strings to contain "NUL" characters.

Send your questions, suggestions, corrections to ereimer@shaw.ca.

ereimer.net

Character-Sets -- by Eugene Reimer 2009-June

How to use the Microsoft cp1252 Character-set on Linux

How to switch to utf8 - for those with files in different 8-bit encodings

Problems with switching to utf8

Problems with mixed-charset data