[TYPO3-dev] RFC: Unicode with preg_replace
Martin Kutschker
masi-no at spam-typo3.org
Tue Mar 23 12:35:03 CET 2010
Dmitry Dulepov schrieb:
> Hi!
>
> There are several bugs for indexed search that are related to Unicode
> and preg_replace functions. All of them are about corruption of indexed
> content because preg_replace does not care about multibyte characters
> unless (1) text is in utf-8 and (2) there is a 'u' modifier.
>
> I think of adding character set conversion code and the modifier to the
> indexed search.
Isn't the data that the index search already in utf-8? I recall that the extension does already a
conversion (when necessary).
Using 'u' on utf-9 strings is a good thing.
> What kind of options do we have:
> - ignore it. This is a rare and unusual case, specific to custom PCRE
> library compilation. We just add a new Unicode PCRE requirement to the
> INSTALL.txt
> - make a check that 'u' is supported. I am not sure how to make yet and
> I think it is an unnecessary overhead for 99.99% of installations
Perhaps this could be done in the "upgrade" code or some other hook for the EM.
> - mask the error using @. Results are unclear (could be empty text?)
PREG should ignore the flag. The result would be that . will not correctly find a single letter (as
it will match on bytes). Character classes will also be affected, eg [aeiouäöü] will not make sense.
Simple string matching OTOH eg /bär/ will also work without 'u'.
These are the implications I can think of right now, but there may of course be others.
> I think that ignoring the problem in the code but mentioning it in the
> INSTALL.txt is the best solution: cheap, works on most servers, "you
> have been warned", etc.
Seems ok for me. Why on earth would you want to compile PREG without utf-8 support nowadays?
Masi
More information about the TYPO3-dev
mailing list