[TYPO3-dev] RFC: Unicode with preg_replace

David Bruchmann typo3-dev at bruchmann-web.de
Tue Mar 23 14:43:48 CET 2010


Von:        Dmitry Dulepov <dmitry.dulepov at gmail.com>
Gesendet:   Dienstag, 23. März 2010 10:09:07

Hi Dmitry,

>
> There are several bugs for indexed search that are related to Unicode
> and preg_replace functions. All of them are about corruption of indexed
> content because preg_replace does not care about multibyte characters
> unless (1) text is in utf-8 and (2) there is a 'u' modifier.

Corresponding settings should be read from localconf.php and 
TypoScript-Setup

>
> I think of adding character set conversion code and the modifier to the
> indexed search. However there is a catch. In some rare cases PCRE can be
> compiled without Unicode support. This will lead to a PHP warning at
> runtime.

Implementing an own conversion is nice but *if* it is somwhere required 
even in extensions it should exist a core-class that can manage those 
things.
Concerning preg_replace there could be written a method that takes care 
of charset-settings.
AFAI remember there exist some classes or functions for 
charset-handling, perhaps they just could be a bit extended, i.e. with 
PREG-Functions, taking care of charsets.

Concerning indexed_search this proposition may take more work on the 
other hand the extension itselfe keeps lighter and the mentioned 
functions are usable by all extensions.

>
> What kind of options do we have:
> - ignore it. This is a rare and unusual case, specific to custom PCRE
> library compilation. We just add a new Unicode PCRE requirement to the
> INSTALL.txt
> - make a check that 'u' is supported. I am not sure how to make yet and
> I think it is an unnecessary overhead for 99.99% of installations

This point I don't understand but Masi wrote something about utf-9 
strings. I wouldn't change the results to mark them:
1) *if* required (I don't think so) then cols should get an extra row to 
mark the charset.
2) Configuration-Settings are read anyway always (I think), so the 
index-charset is clear without further marks. After changing the charset 
the tables can be truncated or converted.
3) With my proposition different charsets can be used depending on 
TypoScript-Setup concerning the charset (normally used by different 
domains in one installation). Nevertheless localconf.php defaults to 
have only one setting (but could be extended). These things may be 
available with other solutions too, perhaps.

> - mask the error using @. Results are unclear (could be empty text?)
> - do not do anything and keep corrupted texts
>
> I think that ignoring the problem in the code but mentioning it in the
> INSTALL.txt is the best solution: cheap, works on most servers, "you
> have been warned", etc.
>

Having a running system without faults always is the best solution ;-)
I think there should be written a general core-documentation concerning 
the serversetup. There exist some documents but they aren't as easy to 
find as the normal documentations and as they aren't bunched in one 
document hard to maintain - i propose something like doc_core_serversetup.

Best Regards
David





More information about the TYPO3-dev mailing list