[TYPO3-dev] RFC: Unicode with preg_replace

Martin Kutschker masi-no at spam-typo3.org
Tue Mar 23 12:35:03 CET 2010


Dmitry Dulepov schrieb:
> Hi!
> 
> There are several bugs for indexed search that are related to Unicode
> and preg_replace functions. All of them are about corruption of indexed
> content because preg_replace does not care about multibyte characters
> unless (1) text is in utf-8 and (2) there is a 'u' modifier.
> 
> I think of adding character set conversion code and the modifier to the
> indexed search.

Isn't the data that the index search already in utf-8? I recall that the extension does already a
conversion (when necessary).

Using 'u' on utf-9 strings is a good thing.


> What kind of options do we have:
> - ignore it. This is a rare and unusual case, specific to custom PCRE
> library compilation. We just add a new Unicode PCRE requirement to the
> INSTALL.txt
> - make a check that 'u' is supported. I am not sure how to make yet and
> I think it is an unnecessary overhead for 99.99% of installations

Perhaps this could be done in the "upgrade" code or some other hook for the EM.

> - mask the error using @. Results are unclear (could be empty text?)

PREG should ignore the flag. The result would be that . will not correctly find a single letter (as
it will match on bytes). Character classes will also be affected, eg [aeiouäöü] will not make sense.
Simple string matching OTOH eg /bär/ will also work without 'u'.

These are the implications I can think of right now, but there may of course be others.

> I think that ignoring the problem in the code but mentioning it in the
> INSTALL.txt is the best solution: cheap, works on most servers, "you
> have been warned", etc.

Seems ok for me. Why on earth would you want to compile PREG without utf-8 support nowadays?

Masi




More information about the TYPO3-dev mailing list