[TYPO3-Solr] Solr question: Protected words did not work as expected

Jigal van Hemert jigal.van.hemert at typo3.org
Wed Jan 2 12:16:05 CET 2013


Hi,

On 2-1-2013 10:54, Bernhard Kraft wrote:
> In our case (website in German) we finally decided to drop every word
> processing of solr except the lowercase filter factory and use wildcard
> matching automatically. The problem is mainly that the german stemmers
> and compound filters are not very good. In fact the problem is that the
> German language has a very unique and difficult to cope with feature:
> Word compounding. You can for example make a compound word of "Riff" and
> "Hai" and get "Riffhai" which is a valid German word. In English it
> simply would be something like "reef shark".

It's not unique to the German language. Dutch also knows a lot of 
compound words and some other languages are even better at glueing words 
together.
For the German language solr already has a list of word parts which is 
used to break up the compound words. The list obviously needs 
improvement and we also need one for Dutch words (and other languages).

<filter class="solr.DictionaryCompoundWordTokenFilterFactory"
   dictionary="german/german-common-nouns.txt"
   minWordSize="5"
   minSubwordSize="4"
   maxSubwordSize="15"
   onlyLongestMatch="false"
/>

> As I already wrote for a project we had here we decided to drop all word
> processing and wrote a little XCLASS which transforms a query for "Hai"
> into "*Hai*" which will yield all results including the 3 letter
> combination "hai". Of course this has a lot of disadvantages as it will
> also yield senseless results where simply those 3 adjacent letters
> appear in a word.

You can use a different solution: solr can index all letter combinations 
with a minimum and maximum length for you.

If you're already adventurous with solr configurations you can add an 
NGram Filter to the schema:

<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/>

This will add parts of each word with a lengthe between 3 and 15 
characters to the index. Without XCLASSing you can find a Riffhai by 
searching for 'hai' or 'iff' or 'ffha' :-)

This is of course only an option if you know what you're doing, but that 
is also the case if you start XCLASSing the solr extension.

-- 
Jigal van Hemert
TYPO3 Core Team member

TYPO3 .... inspiring people to share!
Get involved: typo3.org


More information about the TYPO3-project-solr mailing list