[TYPO3-dev] RFC: Unicode with preg_replace

Martin Kutschker masi-no at spam-typo3.org
Tue Mar 23 16:46:05 CET 2010


Martin Kutschker schrieb:
> 
> Using 'u' on utf-9 strings is a good thing.

Typo: utf-8, of course.


> PREG should ignore the flag. The result would be that . will not correctly find a single letter (as
> it will match on bytes). Character classes will also be affected, eg [aeiouäöü] will not make sense.
> Simple string matching OTOH eg /bär/ will also work without 'u'.

Without 'u' this could work:

a) replace . with [\x01-\x7F]|[\xC2\x-DF].|[\xE0-\xEF]..
a) replace chracter classes with choices eg [aáeê] with a|á|e|ê

Notes:

a) matches only valid single, two and byte sequences
a) to be 100% correct use [\x80-\xBF] instead of . (dot)
b) could be optimized eg [ae]|á|ê

This was a funny exercise, but is it worth to code automatic conversion? No.

Masi




More information about the TYPO3-dev mailing list