[TYPO3-dev] Suggestions for robust UTF-8 support

David Förster david.foerster at andrena.de
Thu Mar 20 13:32:44 CET 2008


Hi to all,

in the pasts weeks we have ported a web site to Typo3, set up to use UTF-8 
everywhere. During that I encountered some problems with the handling of 
multibyte charactersets in Typo3. (Bugs 7869 and 7882)

In case you're not familar with UTF-8*: It's a character set that can store 
virtually any character, including chinese and other funny ones. To 
accomplish that it uses more than one byte for special characters. Once an 
application supports it, in can be used with strings (content in Typo's case) 
in any language. 

The main challenge of supporting UTF-8 is, that you can no longer rely on that 
the length of string is equal to the number of bytes it occupies. By default 
PHP has no UTF-8 support at all, and it's string functions (like strlen) will 
return incorrect results for UTF-8 string. Fortunately there's the mbstring 
extension which provides drop-in replacements for the string functions (eg 
mb_strlen). It also provides the mbstring.func_overload setting which simply 
replaces the default string functions by the mb_* ones. That way, a web 
application like Typo3 can be UTF-8-enabled without any further modification.

However there are some problems with that very setting and Typo3 at the moment 
(the two bugs mentioned above). One of it is the export/import feature 
relying on string-length == byte count. Being aware of UTF-8 you should fix 
this by introducing a byte_count function and use it in the rare cases where 
the byte count of a string is needed instead of it's length. (And example for 
this function is attached to one of the reports.)

For robust UTF-8 support in Typo3 I suggest:

- developers, get familar with UTF-8
- fixing Typo3 to work with mbstring.func_overload enabled (distinguish 
between string-length and byte count)
- suggesting to set this option in the UTF-8 documentation (wiki)

I'm willing to help with that and send patches, but the response to the bug 
reports has been very little so far.

Regards,
David


* UTF-8 is just one popular example of a multibyte characterset. My 
suggestions apply to multibyte charactersets in general.




More information about the TYPO3-dev mailing list