[TYPO3-50-general] UTF-16

Robert Lemke robert at typo3.org
Thu Nov 9 00:19:28 CET 2006


Hi Martin,

Martin Kutschker schrieb:

> Some of my editors (Windows) write BOMs which wasn't liked by older PHP. 
> Probably PHP6 will treat it correctly.

I would expect that at least ...

> IIRC UTF-16 uses a two-bytes for each char opening the problem of 
> endianness. Is there any prefernce for this?

No, not that I know of. As far as I could see, all the examples given 
for PHP6 were Little Endian, but we might ask for what endianness makes 
most sense.

The Unicode support of PHP6 is based on the International Component for 
Unicode (ICU), which is an IBM project [2]. On their website they write:

   "A Unicode string is currently represented as UTF-16. The endianess of
    UTF-16 is platform dependent. You can guarantee the endianess of
    UTF-16 by using a converter. UTF-16 strings can be converted to other
    Unicode forms by using a converter or with the UTF conversion
    macros."

Whatever that means that it's platform dependent ...

>> Use PHP6?
> 
> Given that TYPO3 5 is to me still "only" a vision it makes sense. Zend 
> is probably faster in delivering a stable PHP6 than the TYPO3 community 
> with rewriting TYPO3.

Absolutely. Given the fact that they plan the release for next spring or 
so, PHP6 will be stable enough when TYPO3 5.0 comes out.

>> Use UTF-16 for the PHP files or UTF-8?
> 
> Using UTF-16 will make all files two times biggern then necessary. 
> Roughly 99.99% of all characters of a PHP file are in the ASCII range.
> 
> So I guess that converting from UTF8 to UTF16 takes only minimal amount 
> of time in relation to parsing the PHP code itself. My opinion is: no, 
> don't use it for PHP files a waste of size and perhaps a problem with 
> editors.

After discussing it for a while with Karsten we came to the same 
conclusion, but not because of waste of size (I don't think that that is 
really a problem). The point is that it will most likely cause trouble 
with editors (I wouldn't know for example how to tell vi or Midnight 
Commander to use UTF-16 when I have to repair a script on the live server).

So, let's use UTF-8 for the .php files and UTF-16 everywhere else.

> 
> The same question comes when we talk about West-Europan sites. Do I 
> really want to store UTF16 in my DB? Maybe TYPO3 doesn't need to handle 
> this. At least on Mysql I can have different charsets for client and 
> server. So Mysql could transparently deliver UTF16 but store in UTF8.

As we were told these conversions can really hit performance, so why not 
avoid them? I don't see any drawback in storing the data as UTF-16 in 
the database or XML files. Or is space really an issue?

robert

[1] http://en.wikipedia.org/wiki/International_Components_for_Unicode
[2] http://www-306.ibm.com/software/globalization/icu/index.jsp
[3] http://en.wikipedia.org/wiki/Byte_Order_Mark



More information about the TYPO3-project-5_0-general mailing list