[Typo3-dev] letters, word split and indexed_search, et.al.

Sun Aug 8 01:45:05 CEST 2004

Hi!

AFAIK, Typo3 needs a way for indexed search and other extensions to find 
out if a character (posssibly a multi-byte utf-8!) is a letter. IMHO 
this is needed because word splitting cannot be done by simple regexps 
for arbitrary single-byte charsets easily and it's simply not possible 
for utf-8.

Because I need it in a private project I'm going to write a function 
that will detect Latin, Cyrillic, Greek, Hebrew and Arabic letters. In 
Unicode there are quite a number of more letters, the biggest of them 
are the East Asian ones. I'll ignore the East Asian ones because it has 
been reported that CJK writers don't use spaces and I ignore others 
scripts because there is no language support in Typo3 for them anyway.

Additionally I'll write a general purpose word split routine. I'm not 
sure if what I have in mind is really necessary. So any input is 
appreciated. This is my idea:

function wordsplit($content,$only_words=false,$rep_pos=false)

By default the function will return an array with all the content 
splitted at word boundaries, ie the array will also contain any spaces, 
numbers etc. A join('',$splitted_words) will yield the original string.

If $only_words is set to true then, the any characters that are not 
letters will be omitted.

If $rep_pos is true then the return value will be an array of arrays. 
Each "row" will contain the word (index 0) and the starting position in 
the original string.

Masi