[TYPO3-Solr] Solr question: Protected words did not work as expected

Jigal van Hemert jigal.van.hemert at typo3.org
Wed Jan 2 12:16:05 CET 2013


On 2-1-2013 10:54, Bernhard Kraft wrote:
> In our case (website in German) we finally decided to drop every word
> processing of solr except the lowercase filter factory and use wildcard
> matching automatically. The problem is mainly that the german stemmers
> and compound filters are not very good. In fact the problem is that the
> German language has a very unique and difficult to cope with feature:
> Word compounding. You can for example make a compound word of "Riff" and
> "Hai" and get "Riffhai" which is a valid German word. In English it
> simply would be something like "reef shark".

It's not unique to the German language. Dutch also knows a lot of 
compound words and some other languages are even better at glueing words 
For the German language solr already has a list of word parts which is 
used to break up the compound words. The list obviously needs 
improvement and we also need one for Dutch words (and other languages).

<filter class="solr.DictionaryCompoundWordTokenFilterFactory"

> As I already wrote for a project we had here we decided to drop all word
> processing and wrote a little XCLASS which transforms a query for "Hai"
> into "*Hai*" which will yield all results including the 3 letter
> combination "hai". Of course this has a lot of disadvantages as it will
> also yield senseless results where simply those 3 adjacent letters
> appear in a word.

You can use a different solution: solr can index all letter combinations 
with a minimum and maximum length for you.

If you're already adventurous with solr configurations you can add an 
NGram Filter to the schema:

<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/>

This will add parts of each word with a lengthe between 3 and 15 
characters to the index. Without XCLASSing you can find a Riffhai by 
searching for 'hai' or 'iff' or 'ffha' :-)

This is of course only an option if you know what you're doing, but that 
is also the case if you start XCLASSing the solr extension.

Jigal van Hemert
TYPO3 Core Team member

TYPO3 .... inspiring people to share!
Get involved: typo3.org

More information about the TYPO3-project-solr mailing list