[TYPO3-dev] RFC: Unicode with preg_replace

Dmitry Dulepov dmitry.dulepov at gmail.com
Tue Mar 23 10:09:07 CET 2010


Hi!

There are several bugs for indexed search that are related to Unicode
and preg_replace functions. All of them are about corruption of indexed
content because preg_replace does not care about multibyte characters
unless (1) text is in utf-8 and (2) there is a 'u' modifier.

I think of adding character set conversion code and the modifier to the
indexed search. However there is a catch. In some rare cases PCRE can be
compiled without Unicode support. This will lead to a PHP warning at
runtime.

What kind of options do we have:
- ignore it. This is a rare and unusual case, specific to custom PCRE
library compilation. We just add a new Unicode PCRE requirement to the
INSTALL.txt
- make a check that 'u' is supported. I am not sure how to make yet and
I think it is an unnecessary overhead for 99.99% of installations
- mask the error using @. Results are unclear (could be empty text?)
- do not do anything and keep corrupted texts

I think that ignoring the problem in the code but mentioning it in the
INSTALL.txt is the best solution: cheap, works on most servers, "you
have been warned", etc.

I am looking for comments on this matter.

(Masi, if you are reading this, I especially want your opinion because
you are a known character set & encoding expert in TYPO3)

-- 
Dmitry Dulepov
TYPO3 expert / TYPO3 security team member Read more @
http://dmitry-dulepov.com/




More information about the TYPO3-dev mailing list