[TYPO3-core] RFC #7942: Enable UTF-8 by default

Michael Stucki michael at typo3.org
Fri Nov 12 23:21:51 CET 2010


Hi Jigal,

> The script I use simply reads the field type from the database and uses
> that information to decide which 'binary' field it must use for the
> conversion.

This sounds interesting. So far I have always considered it to be the
job of the Install Tool to fix the field types from binary to text after
the conversion. But it is pretty obvious that storing and restoring the
information should work pretty well and out of the box.

Can you send me your script, so I can take a look at it, please?

>> I'm now in favour of creating a simple mysqldump and replacing the
>> CHARACTER SET statement from table definitions.
> 
> I think that this will complicate things quite a bit. It's easy to mess
> things up with creating a dump and editing the file is way more 'hacky'
> than a few plain queries.

What I like about it is that recode and iconv tell me if there is an
unrepresentable character in the converted content. If you would import
such content in MySQL, it would cut the content at this place without
printing any errors...

> The only problem in both cases is that it's hard to be sure that records
> actually contain incorrectly encoded data. Can we only expect UTF-8 data
> in other columns, or do we have to take other multi-byte fields into
> account?

That's an interesting question, and I believe the answer is no.

TYPO3 users who don't have forceCharset set (no matter to what) get a
backend with the charset which is defined to match their language (see
t3lib_cs->charSetArray). So for example, a user with a Taiwanese backend
would have cp874 as the charset for the backend, so all content he
writes is encoded in cp874 and sent to the database like this.

Now Debian, for example, uses Latin1 as default for new MySQL databases,
and I guess it even uses it in Asia where Latin1 is pretty useless. So
imagine that the user has stored cp874 content into a latin1 database,
but we now need to convert it to UTF-8...

I can only hope that everyone outside of Latin1 areas is using
forceCharset = utf-8. Otherwise I consider them to be the 10% which we
probably cannot convert without manual interaction...


Btw: The problem may get even worse if forceCharset is not set and some
users get a latin1 backend and some others one with cp874... :-o

- michael
-- 
Use a newsreader! Check out
http://typo3.org/community/mailing-lists/use-a-news-reader/


More information about the TYPO3-team-core mailing list