[Typo3-dev] character set handling in Typo 3.6: storage and processing

Mon Oct 6 10:03:38 CEST 2003

Hi!

I have tried to put together my thoughts and ideas on this matter. And so I propose the following two new config options:

SYS['storageCharset']
- the charset/encoding to be used for ALL STORED data

SYS['internalCharset']
- the charset used to process data after input and before storing

* Why remove BE['forceCharset']?

It's not about back end, but about the system. It affects implicitly also the front end. Data from the database may or may not have to be converted before sent to the client (ie browser).

* GFX['TTFLocaleConv'] must go too?

Yes, it only works with the PHP recode extension. If we need a character set for GFX, I suggest to introduce GFX['displayCharset'].

* So what is SYS['storageCharset'] about?

SYS[storageCharset] makes it clear that all data stored in the db (In template files as well? I am unsure about this) will be stored in this particular character set.

For sites with a single language or with a group of languages "traditional" character sets (eg Polish: iso-8859-2, Chinese: gb2312) are ok. Most of the character sets include also the characters of 7-bit ASCII, so a Chinese/English site may be run with gb2312.

If you have languages with different "traditional" character sets (eg French iso-8859-1 and Hungarian iso-8859-2), you must use an encoding of Unicode. For European languages UTF-8 will be the best choice as it is fairly compact and may also be used for HTML pages. If you are using Asian languages UCS-2 or UTF-16 maybe better choices *).

The value of SYS[storageCharset] will be used to store ALL data. Wether entered in the BE or the FE. It will be used for imoport, etc as well.

* Why do we need SYS[internalCharset]?

We must be able to do string processing on the data. We might be able to do character set (encoding) conversions for a number of charsets, we might not be able to operate on all of them.

I suggest to support (besides single-byte charsets already supported by PHP) UTF-8. Support for other Unicode encodings may be added later.

* What must be done to implement these configs?

All code around INSERT/UPDATEs and SELECTS, especially library code, must be reviewed. The scripts must ensure that the correct conversions have been applied and the data been justified, ie truncated **) to the database column length without damaging the last multi-byte character. I have written code for t3lib_cs that implements such a functionality.

Perhaps the conversion can be semi-automated by library calls. Ditto the string truncation if INSERT/UPDATE statements are auto-generated by convenience functions.

Valued Typo3 developers, please think about it. It takes more to implement a multi-encoding, multi-byte application than just to "add UTF-8" somewhere. I have not yet addressed all design options, issues and complexities. Saying this, I suggest that we design these features rather than add than in an ad-hoc manner.

Kind regards to all of you,
Masi

###

*)

This encodings use 2 bytes (rare characters are expressed as 4 byte "surrogates" in UTF-16), which is more than the single bye for US-ASCII. But even accented characters (and cyrillic, greek, hebrew, arabic and other characters) will use 2 bytes in UTF-8. For Chinese characters 3 bytes will be used. For some characters (outside the 16bit-range, not available in UCS-2) even 4 bytes are taken up, when encoded in UTF-8.

**) 

Must be done anyway since not all databases do a silent truncation like MySQL. I have understood that supporting other databases is a long time goal.