[Typo3-dev] letters, word split and indexed_search, et.al.

Martin T. Kutschker Martin-no5pam-Kutschker at blackbox.n0spam.net
Mon Aug 9 15:19:39 CEST 2004


Kasper Skårhøj wrote:
> I think it is bloat but nevertheless a useful library. Certainly we need
> something like it for indexed search. I'm just VERY afraid that it will
> be ENOURMOUSLY slow for indexing pages and hence it might be a fair
> requirement to say that this feature is only available for people having
> some native PHP stuff that does it on utf-8 anyway.

Any C implementation should be faster, but I'm not sure if regexps are 
the fastest way to do it (because the regexp syntax has to be parsed as 
well).

For the search you wouldn't need my proposed word splitting function. I 
don't know how it's done in indexed_search right now, but a possibly 
usefull function is this:

array get_word($charset,$string,$word,$start)

Returns an array containing the word and the starting and end point 
within the given string of the *next* word.

> I'm about to look at utf-8 support in indexed_search within not so long
> so I will come back to it but I really hope you do some good research
> till then.

I'll have a look at a possible implementation. Huge arrays may be slow, 
but perhaps this can be resolved by using ranges instead (ie from code 
point x1 to y1 its letter and from x2 to y2 again, etc).

 > Thanks for working so hard on this!!

You're welcome.

Masi





More information about the TYPO3-dev mailing list