[Typo3-dev] is_letter() for UTF-8 (and other charsets)

Martin T. Kutschker Martin-no5pam-Kutschker at blackbox.n0spam.net
Thu Aug 19 17:44:56 CEST 2004


Hi!

These are the range of languages/charsets supported by Typo3:

u0000 - u024F
Basic Latin
Latin-1 Supplement
Latin Extended-A
Latin Extended-B

u0370 - u6FF
Greek and Coptic
Cyrillic
Cyrillic Supplement
Armenian
Hebrew
Arabic

Within these ranges most characters are letters, so any array containing 
all non-letters would be relative small. All characters outside of these 
ranges can safely be regarded as non-letters.

I guess it will be ok to use a PHP is letter for processing small 
strings (eg word splitting). For larger tasks like indexing a whole page 
it has to be tested.

I have done some functions along the lines I have posted:

is_letter($charset,&$string,&$len,$bytepos=0)
  $string is passed as reference for speed
  in $len the byte-length of the character is returned
  $bytepos is the point within $string to test
  returns true if a letter is found, otherwise false

get_word($charset,&$string,$bytepos)
  $string is passed as reference for speed
  $bytepos is the point within $string to start search for word
  returns an array with the start and the end point of the word

I think it might be useful to have them in 3.7.

BTW, is there a date planned when 3.7-beta will be released? I'd rather 
have the functions in the beta so more testing in the wild can be done.

Masi





More information about the TYPO3-dev mailing list