[TYPO3-dev] RFC: Unicode with preg_replace
David Bruchmann
typo3-dev at bruchmann-web.de
Tue Mar 23 14:43:48 CET 2010
Von: Dmitry Dulepov <dmitry.dulepov at gmail.com>
Gesendet: Dienstag, 23. März 2010 10:09:07
Hi Dmitry,
>
> There are several bugs for indexed search that are related to Unicode
> and preg_replace functions. All of them are about corruption of indexed
> content because preg_replace does not care about multibyte characters
> unless (1) text is in utf-8 and (2) there is a 'u' modifier.
Corresponding settings should be read from localconf.php and
TypoScript-Setup
>
> I think of adding character set conversion code and the modifier to the
> indexed search. However there is a catch. In some rare cases PCRE can be
> compiled without Unicode support. This will lead to a PHP warning at
> runtime.
Implementing an own conversion is nice but *if* it is somwhere required
even in extensions it should exist a core-class that can manage those
things.
Concerning preg_replace there could be written a method that takes care
of charset-settings.
AFAI remember there exist some classes or functions for
charset-handling, perhaps they just could be a bit extended, i.e. with
PREG-Functions, taking care of charsets.
Concerning indexed_search this proposition may take more work on the
other hand the extension itselfe keeps lighter and the mentioned
functions are usable by all extensions.
>
> What kind of options do we have:
> - ignore it. This is a rare and unusual case, specific to custom PCRE
> library compilation. We just add a new Unicode PCRE requirement to the
> INSTALL.txt
> - make a check that 'u' is supported. I am not sure how to make yet and
> I think it is an unnecessary overhead for 99.99% of installations
This point I don't understand but Masi wrote something about utf-9
strings. I wouldn't change the results to mark them:
1) *if* required (I don't think so) then cols should get an extra row to
mark the charset.
2) Configuration-Settings are read anyway always (I think), so the
index-charset is clear without further marks. After changing the charset
the tables can be truncated or converted.
3) With my proposition different charsets can be used depending on
TypoScript-Setup concerning the charset (normally used by different
domains in one installation). Nevertheless localconf.php defaults to
have only one setting (but could be extended). These things may be
available with other solutions too, perhaps.
> - mask the error using @. Results are unclear (could be empty text?)
> - do not do anything and keep corrupted texts
>
> I think that ignoring the problem in the code but mentioning it in the
> INSTALL.txt is the best solution: cheap, works on most servers, "you
> have been warned", etc.
>
Having a running system without faults always is the best solution ;-)
I think there should be written a general core-documentation concerning
the serversetup. There exist some documents but they aren't as easy to
find as the normal documentations and as they aren't bunched in one
document hard to maintain - i propose something like doc_core_serversetup.
Best Regards
David
More information about the TYPO3-dev
mailing list