Thursday, March 31, 2005

Unicode: now, later, or beyond?

I've known about Unicode and internationalization issues for many years, but still haven't settled my mind on it. After reading this interesting post on the topic, the question returns to haunt me: Is it finally time to bite the bullet and adopt full-bore 16-bit Unicode, or can we continue to defer the issue? Now, the question that occurs to me is deeper: Is there something significantly beyond Unicode that we should be considering?

Personally, I think the idea of fixed-character strings is an archaic artifact of our computational "early years" and it's time to move on. Take a look at HTML or XML or SGML They have the concept of a "character entity", where an extended character code or even a name for a character can be encoded. I don't want to suggest that we use XML as our new character representation format, but at least it's worth considering, and it's already there as a high-level external representation format.

Oddly, people are still concerned about the storage space and performance of 16-bit characters. Geez, get over it already.

Think about it: 8-bit character codes, they fit in a "byte"... how quaint and useless for computing in the 21st Century.

-- Jack Krupansky

2 Comments:

At 6:53 PM MDT , Blogger Joe Beda said...

Unicode has already moved beyond 16 bits per character. The original UCS-2 encoding (which Windows uses all over the place) has been transformed to UTF-16 as a representation of UCS-4. This means that there are special 16 bit unicde characters that are lead bytes for encodings of 32 bit codepoints. These are generally called Unicode Surrogates.

A quick google search turned up this page:
http://www.jbrowse.com/text/unicode.shtml

 
At 8:31 PM MST , Anonymous Anonymous said...

What If This Could All Happen Automatically,
with a simple push of a button.....

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home