[TYPO3-dev] UTF8 problem when parsing XML data...

Jigal van Hemert jigal.van.hemert at eurorscg.nl
Thu Jul 13 13:01:38 CEST 2006


> Jigal van Hemert wrote:
> > I think the problem lies in the original XML. The 
> "special-chars" as 
> > mentioned are not encoded using the encoding that was 
> indicated in the 
> > XML-header, but represented by (numerical) entities.
> 
> And they are always Unicode characters (not utf-8 encoded, 
> but plain Unicode value!). I checked character from original 
> message, it is correct Unicode symbol.

Wat is "plain Unicode value"? Unicode is a collection of characters divided into groups ("planes" IIRC). These characters can be represented in a number of ways. In the example there was an 'entity' é. This is not a unicode value, but a numerical representation of a character. 233 is in Latin-1 (often encoded in ISO-8859-1) an 'e with accent aigu'. 
If the XML header says the document encoding is utf-8 then you don't have to use these entities, since utf-8 encoding can be used to encode the entire Unicode character set. 

Character 233 is represented in utf-8 encoding with two bytes C3A9, which are displayed as é if the document is interpreted as encoded in ISO-8859-1.

Therefore, during the chain of processes two steps must have occurred:
- entities are converted to utf-8 encoded characters
- the document is interpreted as being ISO-8859-1 encoded

It is very unlikely that the browser is responsible for both. Since the default internal encoding for PHP 4.x is ISO-8859-1 I suspect that the document was handled in internal encoding by PHP after the entities were converted to encoded characters.
An XML-transformation usually converts entities to characters (encoded in the target encoding).

If the transformed document was stored in a database and the database client and/or the server were not set correctly it could be the case that utf-8 encoded data is returned as if it were iso-8859-1 encoded. This would explain the two-byte representation in the resulting text.

So the only thing you can do is check and double check every step in the process of the document....

Regards, Jigal.




More information about the TYPO3-dev mailing list