[Typo3-dev] letters, word split and indexed_search, et.al.
Martin T. Kutschker
Martin-no5pam-Kutschker at blackbox.n0spam.net
Sun Aug 8 01:45:05 CEST 2004
Hi!
AFAIK, Typo3 needs a way for indexed search and other extensions to find
out if a character (posssibly a multi-byte utf-8!) is a letter. IMHO
this is needed because word splitting cannot be done by simple regexps
for arbitrary single-byte charsets easily and it's simply not possible
for utf-8.
Because I need it in a private project I'm going to write a function
that will detect Latin, Cyrillic, Greek, Hebrew and Arabic letters. In
Unicode there are quite a number of more letters, the biggest of them
are the East Asian ones. I'll ignore the East Asian ones because it has
been reported that CJK writers don't use spaces and I ignore others
scripts because there is no language support in Typo3 for them anyway.
Additionally I'll write a general purpose word split routine. I'm not
sure if what I have in mind is really necessary. So any input is
appreciated. This is my idea:
function wordsplit($content,$only_words=false,$rep_pos=false)
By default the function will return an array with all the content
splitted at word boundaries, ie the array will also contain any spaces,
numbers etc. A join('',$splitted_words) will yield the original string.
If $only_words is set to true then, the any characters that are not
letters will be omitted.
If $rep_pos is true then the return value will be an array of arrays.
Each "row" will contain the word (index 0) and the starting position in
the original string.
Masi
More information about the TYPO3-dev
mailing list