[TYPO3-dev] RFC: Unicode with preg_replace
Martin Kutschker
masi-no at spam-typo3.org
Tue Mar 23 16:46:05 CET 2010
Martin Kutschker schrieb:
>
> Using 'u' on utf-9 strings is a good thing.
Typo: utf-8, of course.
> PREG should ignore the flag. The result would be that . will not correctly find a single letter (as
> it will match on bytes). Character classes will also be affected, eg [aeiouäöü] will not make sense.
> Simple string matching OTOH eg /bär/ will also work without 'u'.
Without 'u' this could work:
a) replace . with [\x01-\x7F]|[\xC2\x-DF].|[\xE0-\xEF]..
a) replace chracter classes with choices eg [aáeê] with a|á|e|ê
Notes:
a) matches only valid single, two and byte sequences
a) to be 100% correct use [\x80-\xBF] instead of . (dot)
b) could be optimized eg [ae]|á|ê
This was a funny exercise, but is it worth to code automatic conversion? No.
Masi
More information about the TYPO3-dev
mailing list