2009-08-29: ER: made v1.01 based on libcharguess by Stephane Corbe, which in turn was based on 2003 source from Mozilla. It installs lib/libcharguess include/charguess.h bin/charsetdetective man/man1/charsetdetective.1. 2009-08-30: ER: made v2.00 from current Mozilla source, from http://mxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/ renaming nsUniversalDetector.h to universalchardet.h, and making minor modifications, flagged with "//ER:" comments, to universalchardet.h and nsUniversalDetector.cpp. This version omits the "charguess" interface by Corbe since I prefer the Mozilla interface. It installs lib/libuniversalchardet include/universalchardet.h bin/charsetdetective man/man1/charsetdetective.1. The MOZFILES.txt file shows the date of each Mozilla file as included in this version. You may want to compare it to http://mxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/ to detect new versions. Testing & Comparing v2.00 to v1.01: On a set of 49 "language-pack" files, in a wide variety of encodings, where the encodings are known (28 languages with many in two encodings), v1.01 got the wrong answer on 9 of them, with utf8 being misdiagnosed as EUC-JP, EUC-KR, gb18030, and Hebrew in windows-1255 as x-mac-cyrillic. Those misdiagnoses accounted for 6 of the 9 errors. All 6 of those are fixed by v2.00! However it introduced one new one where Dutch in latin1 is misdiagnosed as windows-1255 -- it's gone from not recognizing Hebrew to seeing it everywhere. And the results are unchanged on the failure to recognize Czech or Slovak in cp1250, Hungarian in 8859-2. This Mozilla software is the best I've found for language and encoding detection, and while it has improved a good deal since 2003, there's room for further improvement. Further testing on v2.00: During further testing on thousands of files, the vast majority being in compatible subsets of cp1252 (ie: ASCII, 8859-1, or cp1252), or in UTF-8, I concentrated on the roughly one hundred files diagnosed as something other than those. I encountered only one clear-cut ERROR: a file in Plautdietsch in cp1252, was misdiagnosed as cp1255 (Hebrew). Its most blatant errors involve Nether-Germanic languages (Dutch and Plautdietsch) being misdiagnosed as Hebrew. Just because those languages resemble Yiddish does not make them close to Hebrew:-) Aha, my little joke may actually explain what has gone wrong: their supposedly "Hebrew" samples have been contaminated with Yiddish? The GENEWEB source (my version is from 2004-Dec14) has several "language pack" files that are perverse, being multi-lingual and multi-charset! And (in an exception to the when-in-doubt-assume-Hebrew rule) these are deemed to be in cp1251 (Cyrillic) -- some parts actually are; The GENEWEB files and the Plautdietsch file are cases where "unknown" would have been the right answer, but not what was given. Incidentally I have hundreds of files in Plautdietsch, and only one was misdiagnosed. Note: it does give "unknown" for too many files. In my testing, all but one "unknown" turned out to be safely treated as "windows-1252" (mind you, the vast majority of my files are safe that way); the one exception being the Slovak file in cp1250 mentioned earlier under "language pack" testing. Other results were mildly disappointing, although helpful when charsetdetective is used in conjunction with simple-minded tests: several files in a messed up mixture of encodings, part cp1252 and part utf8, are pronounced to be utf8, despite being provably non-utf8. However, since my other methods regarded them as windows-1252, this differing opinion was helpful in finding such anomalies. Here too "unknown" would appear to be the correct, albeit less useful, answer. On another set of files that are doubly messed up, being in a mixture of cp850 and cp850 that's been mis-converted to UTF-8 (having undergone a cp1252-to-utf8 conversion), charsetdetective does less well. Some of these are pronounced to be utf8, others cp1252. Some are diagnosed as cp1252 despite being trivially provable to be something else. For these, "cp850" would be a better answer. Asking it to do something useful with such a mess is asking a lot. However, it turns out that valid cp850 files are also commonly misdiagnosed as being cp1252, or as "unknown". Beware: if you have old files in the IBM/MS-DOS codepages 437, 850, 858, etc, you're advised to to identify them by some other method.