[Typo3-dev] is_letter() for UTF-8 (and other charsets)
Martin T. Kutschker
Martin-no5pam-Kutschker at blackbox.n0spam.net
Thu Aug 19 17:44:56 CEST 2004
Hi!
These are the range of languages/charsets supported by Typo3:
u0000 - u024F
Basic Latin
Latin-1 Supplement
Latin Extended-A
Latin Extended-B
u0370 - u6FF
Greek and Coptic
Cyrillic
Cyrillic Supplement
Armenian
Hebrew
Arabic
Within these ranges most characters are letters, so any array containing
all non-letters would be relative small. All characters outside of these
ranges can safely be regarded as non-letters.
I guess it will be ok to use a PHP is letter for processing small
strings (eg word splitting). For larger tasks like indexing a whole page
it has to be tested.
I have done some functions along the lines I have posted:
is_letter($charset,&$string,&$len,$bytepos=0)
$string is passed as reference for speed
in $len the byte-length of the character is returned
$bytepos is the point within $string to test
returns true if a letter is found, otherwise false
get_word($charset,&$string,$bytepos)
$string is passed as reference for speed
$bytepos is the point within $string to start search for word
returns an array with the start and the end point of the word
I think it might be useful to have them in 3.7.
BTW, is there a date planned when 3.7-beta will be released? I'd rather
have the functions in the beta so more testing in the wild can be done.
Masi
More information about the TYPO3-dev
mailing list