[TYPO3-dev] problem:: report & analysis: 4.1.2 is coming with codepage inconsistancy

Martin Bless m.bless at gmx.de
Sat Aug 18 17:48:09 CEST 2007


Hi Ernesto,

Ernesto Baschny wrote on Fri, 17 Aug 2007 20:47:36 +0200:

>What do you mean with "MySql-dump will contain UTF-8 errors"? 

Yes, I can see that the term "error" may be misleading and needs a
clarification. I should have used another word but didn't have the
knowledge at that time.

>MySQL version are you using? 

TYPO3-4.1.2,  MySql-4.1.22, PHP-4.4.1, phpMyAdmin-2.6.4.-pl3.

> What is the exact "error" you get and when?

The answer is twofold: its (a) scrambled content (b) the database dump
isn't legal UTF-8.

(b) is a MySql issue and off topic here. It's a decision the MySql
devopers made. A dabase may of course contain any data. But a dump
file claiming to be UTF-8 should describe the data in legal UTF-8
form. At least that's what I think. But with MySql this isn't reality.
This needs an escaping mechanism of course. But, as I said, I learned
its different with MySql and not relevant here.

>I think it is not illegal to have latin-1 bytes in UTF-8 tables (or the
>other way around), its just a matter what you do with that data in your
>application.

Yes, it's always about semantics. Now concerning (a) and TYPO3: You
don't have to be danish to know that something must be wrong if you
find text like "lang.dk = Dette website er dynamisk genereret af TYPO3
CMS - frit tilg�ngeligt" in table static_template. Unfortunately I
erroneously took this as an indicator telling me that something is
wrong with the setup of my installation. Being mislead when searching
errors was the real problem to me.

> So I wonder where did you hit this problem?

Glad you ask! It seems to me that nowadays installations should use
UTF-8 wherever possibles. Just a "simple" UTF-8 installation of TYPO3
is all I was heading for. Here is what I experienced.

First try: created a new database, set an UTF-8 collation, set
$TYPO3_CONF_VARS['BE']['forceCharset'] = 'utf-8'; Everything looked
fine until I found out that I was badly wrong. Database and tables
were UTF-8 but the data was not stored natively encoded but had
undergone a second conversion to UTF-8.

Second try: If found out that I had to set 'setDBinit'. So I started
again and used $TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES
utf8'.chr(10).'SET CHARACTER SET utf8'.chr(10).'';  and, another try, 
$TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;'.chr(10).'SET
CHARACTER SET utf8;'.chr(10).'SET SESSION character_set_server=utf8;';
Again at first everything looked fine. Data in tt_content was natively
UTF-8. But for instance Umlaute like 'ÄÖÜ' in constants of setup
fields of templates (sys_template) couln't be saved. I used to receive
this error:

Errors: 102: These fields are not properly updated 
     in the database: (constants) Probably value mismatch 
     with fieldtype.

At that point I was rather desparate (I'm hoping you understand) until
Karsten Dambekalns in [TYPO3-English] adviced me to ["Remove the set
character set, leave only the set names, try again. Worked for me,
multiple times."]  I did and followed his advice and it works! These
two lines seem to do the trick:
$TYPO3_CONF_VARS['BE']['forceCharset'] = 'utf-8';
$TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;'.chr(10) ;

It's great it works but I'm very unhappy with the situation that I
don't have a real understanding. What I currently do is more "trial
and error" praying for some kind of computer voodoo. And this is the
answer to your question: In trying to find out what's happening and
gain more understanding I hit the problem.

Besides, it's really difficult to find relevant information on how and
why to set 'setDBinit' on the net.

The open questions to me are: Given a concrete hosting situation: 
- What measures can I take to find out what settings in localconf.php
I should use?
- If it works: Can I trust it will continue working?

Now back to our tables:

>The mentioned static-tables (from typo3/cms/) are almost all obsolete,
>in special those that contain 8-bit codes (translation for several
>old-school extensions).

Only two tables are non-ascii. The one in
typo3_src-4.1.2\typo3\sysext\tsconfig_help is UTF-8 and semantically
ok. The one in typo3_src-4.1.2\typo3\sysext\cms is "mostly" latin-1
with six additional unicode 'lost character' indicators. I had a look
into the 3.8.1 version. Same situatuation there. 

>But you are right that in general no charset conversion is made when
>reading in ext_tables_static+adt.sql. This is not really possible,
>because we will have to know (and record) the charset for every INSERT
>statement in that file as we could have different charsets in a single
>installation (e.g. each language has its own default charset). Another
>choice would be to force this file to be UTF-8 encoded by default, but
>then we will have to know the language for each single INSERT statement
>to be able to do a conversion to the chosen charset for that specific
>installation. Both I consider non-trivial.

Hhm,  "on each single INSERT statement"?  I don't know the TYPO3 code
good enough to judge this. To me the options seem to be: (1)  Use
ascii *.sql files only. If possible that's the way to go. (2) If ascii
isn't sufficient we need to agree on the meaning of the bytes in the
*.sql files since it will make a difference how they are imported.
Probably UTF-8 is the right choice here. (3) *.sql files could carry
an "encoding marker", let's say '-- encoding: latin-1' for example.
This is the way the Python folks mark their files.

>Best and easiest is to use UTF-8 in ext_tables_static+adt.sql and have
>people use forceCharset to UTF-8 if they want to use them.

Personally I would go for (1) and (2) and agree fully with this
statement.

>  And instead of fixing the provided sysext/cms/ext_tables_static+adt.sql file, we
>should just drop those static templates "for good". :)

I won't miss them ;-)

>Cheers,
>Ernesto

Yes, have a nice day

Martin




More information about the TYPO3-dev mailing list