[TYPO3-core] How to treat UTF BOM?

Mon Dec 26 21:27:48 CET 2011

Hi,

On 26-12-2011 13:49, Steffen Gebert wrote:
> As such things can cause serious headache, I'm trying to find a solution
> for it (aka. skip the Byte Order Mark somewhere during processing).
> As I think it's insufficient to remove it while processing
> INCLUDE_TYPOSCRIPT, I tend to remove it directly in t3lib_div::getUrl().

getUrl() isn't the right place, simply because it can be valuable 
information about the contents. It shouldn't be skipped because you will 
then lose the info.
getUrl() needs to retrieve the file contents without knowing the kind of 
content or processing it.

> * Only removing UTF-8 BOM (EF BB BF) as in the proposed patch is not
> enough. There are different variants of UTF-16 (big/little endian) and
> also UTF-1, -7, -32 and other charsets. Maybe mb_string could help.
> Otherwise search for each of the known BOMs? (see wikipedia for some
> (all?) possible).

UTF-xx aren't character sets but encodings. UTF is short for UCS 
Transformation Format (UCS = Universal Character Set).
The character set is in all cases Unicode, the encoding is different for 
UTF-7, -8, -16, -32, etc.

A solution would be to convert the contents to the internal character 
set of TYPO3 before using it. We could add a method to detect the BOM 
and use existing methods to convert this; mb_convert_string(), 
recode_string can properly handle "endianness" (that seems the term for 
it in man pages). I'm not sure if iconv() can do the trick.

> * Do you think that it is correct to place it in getUrl()?
> IMHO people want the file contents and not take care of such meta
> information. I don't have a clue ATM, what the result of a UTF-16
> without BOM would be.

It's not meta information. I determines the encoding of the contents.

-- 
Kind regards / met vriendelijke groet,

Jigal van Hemert.