One I can't explain
While trying to clean up some uncatagorized pages, I hit upon the page [ [User:Оракин/Оракин] ] - "[ [User:Оракин] ]"
I moved [ [Оракин] ] (and got rid of the redirect) to [ [User:Оракин/Оракин] ] -- so far so good, adding "Category: Users" to the root page, as it had no category either, and was not showing up in the Category:Users list.
Then I stuck the category: Dwarf Player Characters on that page... ok also.
But then when I went to look at the category Category: Dwarf Player Characters, I find [ [User:Оракин/Оракин] ] not under OH as I would have expected, but apparently under zero -- but after "Z".
I can't tell if that is a zero or an oh... they both look identical on my screen -- like an uppercase oh. Only the sort order gives a different impression.
I also note the in the Category:Users list, that entry is the very last one...
Is this just an artifact of the fact that this is actually a Cyrillic letter? Guessing from the rest of his caption, or whatever character set is being used?
Is there any way to "view" the ascii of that string on the WIKI? One assumes it is a standard UTF-8 or some such encoding.
I can easily insert such characters on the WIKI, but do not know how to view them "in their native ascii" -- assuming that it is possible, of course.
BTW, Google translates the caption as Russian: 'Page Minstrel Aglarond Orakina Laytbringera Tumunzahar Commonwealth of Khazad'
- This user's name and character name use Unicode characters from the Cyrillic character set. They will likely be encoded using UTF-8. You have to be careful working with such text to avoid trashing it by passing it through components that are not 8-bit clean and UTF-8 aware. You don't want to ever get the UTF-8 converted to an 8-bit encoding, as will sometimes happen if you copy and paste through editors or tools on your local system. They should be preserved properly if you copy/paste from one browser window to another; at least they are for me on a Win7 box running Firefox.
- Unicode sort sequences are a interesting and complex subject. Sort sequences range from simple binary sorts on the 16-bit (or 32-bit) code points, to more intelligent orderings that attempt to honour the sort order used by a language or regional variant. Most of us are familiar with the sort order for ascii, where the character encoding is arranged so that a binary sort and customary dictionary sort for letters are identical. But when we add the common European accented characters to the mix, the binary code point sort order no longer corresponds to the dictionary sort order, and, we must also care about the language and region of the dictionary. On the wiki's preference page, there is a drop down box where we can select a language and region. The default is "en - English", with regional variants like "en-CA - Canadian English". These choices will (potentially) specify different sort orders for handling accented characters, perhaps because Canadian English tends to include more accented characters from French language place names and other words. Characters which are not normally part of these languages or regional variations will probably be sorted according to their binary code point values. This is what will place the Cyrillic characters toward the end of the sort order.
- We don't want to ever convert non-ASCII Unicode characters to one of the 8-bit code page encoding. The multitude of 8-bit encodings were an early hack at supporting internationalization from an era when bytes were expensive. A couple of the early hardware platforms I worked on each had their own 6-bit character encodings that would pack 6 6-bit characters into their 36-bit words. The biggest problem with those 8-bit encoding was that you could not mix and match say, Greek, Arabic, and Cyrillic, in the same text document. With Unicode, and UTF-8, that is easily done.
- When it comes to viewing uncommon Unicode characters, you need to install a font that has glyphs for the characters you want to view. If you have at least one font that has a large selection of code points, your browser should be able to use it to fill in the missing areas from your preferred font. I always make sure to install and test internationalization support on my computers. For a quick test, translate some text using Google translate into Greek, Arabic, and Russian. If the translations are visible in the appropriate scripts, you have the right fonts installed in your browser. But, you will have to test out non-browser tools, such as editors, to make sure they will work. For instance, copy/paste the next four line into your favourite edit, save, close, and re-open the file to test this out.
- The quick brown fox jumps over the lazy dog.
- Η γρήγορη καφέ αλεπού πηδάει πάνω από το μεσημέρι.
- الثعلب البني السريع يقفز فوق الكلب الكسول.
- Быстрая коричневая лиса прыгает над ленивой собакой.
- Oh, and just for fun, try translating Google's translations back to English.
Thanks, Pretty much what I expected. One nice thing about OSX is that, by default, it does support Internalization big-time... some 30 languages and their fonts are standard; and all are part of Apple's native apps (Safari and TextEdit). Only "the Terminal" (and hence Emacs) is pure ASCII by default.
Love on-the fly Spell checkers: Internalization = internationalization
Hmmm wonder what google does with that old joke about the auto-translation of the hotline -- "The Spirit is willing, but the flesh is weak" round translated into "The vodka is good, but the meat is rotten:.