[TYPO3-v4] Database utf-8 conversion and detection
Jigal van Hemert
jigal at xs4all.nl
Fri May 13 14:45:37 CEST 2011
Hi,
In 4.5 we made UTF-8 the default character set and also set up the
database connection using UTF-8 by default.
I already made a (standalone) script [1] to convert utf-8 encoded data
which is stored in latin-1 (or other charsets) tables. The problem so
far was to detect such situations.
I think I found the solution to find potential fields with incorrectly
encoded data [2].
It uses the fact that regular expressions in MySQL are not UTF-8
compliant. The regular expression tests for the character ranges which
occur in UTF-8 strings [3]; only columns without utf-8 charset (and
collation) are tested. The output of the test script is the name of each
table and if the regexp matches the name of column (plus exclamation mark).
The test is not 100% full proof, simply because it could happen that
someone put a latin-1 character sequence in the database which happens
to form a valid utf-8 string. I think the install tool could include a
test like this and if it finds potential problems offer the user the
option to perform the conversion [1].
What do you think?
[1] http://www.xs4all.nl/~dcbjht/typo3/db_utf8_fix.zip
[2] http://www.xs4all.nl/~dcbjht/typo3/db_utf8_test.zip
[3] http://en.wikipedia.org/wiki/UTF-8
--
Kind regards / met vriendelijke groet,
Jigal van Hemert.
More information about the TYPO3-project-v4
mailing list