[TYPO3-dev] RFC: Unicode with preg_replace

Jigal van Hemert jigal at xs4all.nl
Tue Mar 23 23:01:47 CET 2010


David Bruchmann wrote:
> But I read many documents about charsets and know that a solution
> like that may display some characters wrong even if it's readable.

Unicode is an attempt to standardize a huge set of characters from all
kinds of (natural) languages.
The unicode.org website explains that there are some characters which
are shared between a few languages, but *look* slightly different in
each language. Unicode does not attempt to contain glyphs (graphical
representations of characters), but characters. By using the right font
these characters will be displayed correctly.

> Furthermore it's not important what I think about any solution
> because I neither know speaking nor writing any asian languages. More
> important is to know that utf-8 isn't accepted by all people and
> until there is perhaps sometime a really global charset we've to live
> with different ones.

Unicode is a work in progress to be a global character set. The fact
that some people do not accept is not really relevant.

> By the way: Just for displaying some african languages you have to 
> download extra fonts where charset and font is nearly the same
> because fonts for those languages are rare.

Fonts are collections of glyphs mapped to a certain character set. For
some African languages there may or may not be Unicode fonts available.

> I haven't verified how characters are defined in those charsets but
> it shows again that utf-8 can't fit all requirements.

To make it a bit more complicated: utf-8 (and utf-16 and utf-32) are 
merely encodings of the unicode character set. In utf-8 the lower ASCII 
set is encoded by the same byte codes as in ASCII. The high-ASCII 
characters (Latin-1) are most of the time represented by the same byte 
codes, but not all! CJK characters are encoded by multiple bytes.

Some users of CJK characters may like character sets and encodings where 
their favourite character are encoded by single bytes. This may 
contribute to the fact that some people may not accept Unicode or the 
utf-8 encoding.

HTH
-- 
Jigal van Hemert.




More information about the TYPO3-dev mailing list