[TYPO3-v4] Database utf-8 conversion and detection

Jigal van Hemert jigal at xs4all.nl
Fri May 13 14:45:37 CEST 2011


Hi,

In 4.5 we made UTF-8 the default character set and also set up the 
database connection using UTF-8 by default.

I already made a (standalone) script [1] to convert utf-8 encoded data 
which is stored in latin-1 (or other charsets) tables. The problem so 
far was to detect such situations.

I think I found the solution to find potential fields with incorrectly 
encoded data [2].
It uses the fact that regular expressions in MySQL are not UTF-8 
compliant. The regular expression tests for the character ranges which 
occur in UTF-8 strings [3]; only columns without utf-8 charset (and 
collation) are tested. The output of the test script is the name of each 
table and if the regexp matches the name of column (plus exclamation mark).

The test is not 100% full proof, simply because it could happen that 
someone put a latin-1 character sequence in the database which happens 
to form a valid utf-8 string. I think the install tool could include a 
test like this and if it finds potential problems offer the user the 
option to perform the conversion [1].

What do you think?

[1] http://www.xs4all.nl/~dcbjht/typo3/db_utf8_fix.zip
[2] http://www.xs4all.nl/~dcbjht/typo3/db_utf8_test.zip
[3] http://en.wikipedia.org/wiki/UTF-8

-- 
Kind regards / met vriendelijke groet,

Jigal van Hemert.


More information about the TYPO3-project-v4 mailing list