[TYPO3-Solr] Solr question: Protected words did not work as expected

Bernhard Kraft kraft at web-consulting.at
Wed Jan 2 10:54:42 CET 2013


Hello !

Am 27.12.2012 15:50, schrieb Hauke Meyer:

> A person with a special last name was not found in the first place but a
> lot of "stemmed" words before. In this special case an exact match
> without "stemming" is necessary, so I think the protected word feature
> would be a hit. The protwords.txt is in the right place (I used the
> "make a XML error technique" to proof this).

I also noticed, that solr does a lot of processing both on the indexed 
content and the search query. A nice way of analyzing what is done with 
your content/query is to use the "Analyzer" tool of the solr web admin 
interface found at port localhost:8080 on your solr-machine.

In our case (website in German) we finally decided to drop every word 
processing of solr except the lowercase filter factory and use wildcard 
matching automatically. The problem is mainly that the german stemmers 
and compound filters are not very good. In fact the problem is that the 
German language has a very unique and difficult to cope with feature: 
Word compounding. You can for example make a compound word of "Riff" and 
"Hai" and get "Riffhai" which is a valid German word. In English it 
simply would be something like "reef shark".

So the problem in German is to decomponize those words into nouns. This 
is different from stemming/lemmatisation.

As I already wrote for a project we had here we decided to drop all word 
processing and wrote a little XCLASS which transforms a query for "Hai" 
into "*Hai*" which will yield all results including the 3 letter 
combination "hai". Of course this has a lot of disadvantages as it will 
also yield senseless results where simply those 3 adjacent letters 
appear in a word.

Anyways: The analyzer tool of the solr web admin interface is a good 
place to start tracking down your problem.


greetings,
Bernhard


More information about the TYPO3-project-solr mailing list