[TYPO3-Solr] Solr question: Protected words did not work as expected
Bernhard Kraft
kraft at web-consulting.at
Wed Jan 2 10:54:42 CET 2013
Hello !
Am 27.12.2012 15:50, schrieb Hauke Meyer:
> A person with a special last name was not found in the first place but a
> lot of "stemmed" words before. In this special case an exact match
> without "stemming" is necessary, so I think the protected word feature
> would be a hit. The protwords.txt is in the right place (I used the
> "make a XML error technique" to proof this).
I also noticed, that solr does a lot of processing both on the indexed
content and the search query. A nice way of analyzing what is done with
your content/query is to use the "Analyzer" tool of the solr web admin
interface found at port localhost:8080 on your solr-machine.
In our case (website in German) we finally decided to drop every word
processing of solr except the lowercase filter factory and use wildcard
matching automatically. The problem is mainly that the german stemmers
and compound filters are not very good. In fact the problem is that the
German language has a very unique and difficult to cope with feature:
Word compounding. You can for example make a compound word of "Riff" and
"Hai" and get "Riffhai" which is a valid German word. In English it
simply would be something like "reef shark".
So the problem in German is to decomponize those words into nouns. This
is different from stemming/lemmatisation.
As I already wrote for a project we had here we decided to drop all word
processing and wrote a little XCLASS which transforms a query for "Hai"
into "*Hai*" which will yield all results including the 3 letter
combination "hai". Of course this has a lot of disadvantages as it will
also yield senseless results where simply those 3 adjacent letters
appear in a word.
Anyways: The analyzer tool of the solr web admin interface is a good
place to start tracking down your problem.
greetings,
Bernhard
More information about the TYPO3-project-solr
mailing list