[TYPO3-core] RFC #7942: Enable UTF-8 by default

Fri Nov 12 20:19:26 CET 2010

Hi Michael,

On 11-11-2010 23:01, Michael Stucki wrote:
>> Converting is rather simple when you first convert the columns to a
>> binary type which is comparable with the original (VARCHAR ->  VARBINARY,
>> TEXT ->  BLOB, etc.) and then convert them to the original type with the
>> utf8 charset defined for that column.
>
> Right, I can confirm this works (used it for a while more or less
> without problems). However, it's clearly a hack and depends on the
> Install Tool for converting fields back to the right types (VARBINARY =>
> VARCHAR, etc.)

Yes, it works. The script has rescued the content of many a site in the 
past.
I don't consider it "hacky" as it uses clearly documented features of 
MySQL. If UTF-8 encoded data is stored in -- let's say -- latin1 columns 
we need to make MySQL believe that the stored bytes are in fact to be 
interpreted as UTF-8 instead of Latin1.
The two queries do exactly that; no hacks, undocumented features, dirty 
side effects, just plain valid SQL.

The script I use simply reads the field type from the database and uses 
that information to decide which 'binary' field it must use for the 
conversion.

> I'm now in favour of creating a simple mysqldump and replacing the
> CHARACTER SET statement from table definitions.

I think that this will complicate things quite a bit. It's easy to mess 
things up with creating a dump and editing the file is way more 'hacky' 
than a few plain queries.

The only problem in both cases is that it's hard to be sure that records 
actually contain incorrectly encoded data. Can we only expect UTF-8 data 
in other columns, or do we have to take other multi-byte fields into 
account?

-- 
Kind regards / met vriendelijke groet,

Jigal van Hemert
skype:jigal.van.hemert
msn: jigal at xs4all.nl
http://twitter.com/jigalvh