[TYPO3-Solr] Solr question: Protected words did not work as expected
Jigal van Hemert
jigal.van.hemert at typo3.org
Wed Jan 2 12:16:05 CET 2013
Hi,
On 2-1-2013 10:54, Bernhard Kraft wrote:
> In our case (website in German) we finally decided to drop every word
> processing of solr except the lowercase filter factory and use wildcard
> matching automatically. The problem is mainly that the german stemmers
> and compound filters are not very good. In fact the problem is that the
> German language has a very unique and difficult to cope with feature:
> Word compounding. You can for example make a compound word of "Riff" and
> "Hai" and get "Riffhai" which is a valid German word. In English it
> simply would be something like "reef shark".
It's not unique to the German language. Dutch also knows a lot of
compound words and some other languages are even better at glueing words
together.
For the German language solr already has a list of word parts which is
used to break up the compound words. The list obviously needs
improvement and we also need one for Dutch words (and other languages).
<filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="german/german-common-nouns.txt"
minWordSize="5"
minSubwordSize="4"
maxSubwordSize="15"
onlyLongestMatch="false"
/>
> As I already wrote for a project we had here we decided to drop all word
> processing and wrote a little XCLASS which transforms a query for "Hai"
> into "*Hai*" which will yield all results including the 3 letter
> combination "hai". Of course this has a lot of disadvantages as it will
> also yield senseless results where simply those 3 adjacent letters
> appear in a word.
You can use a different solution: solr can index all letter combinations
with a minimum and maximum length for you.
If you're already adventurous with solr configurations you can add an
NGram Filter to the schema:
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/>
This will add parts of each word with a lengthe between 3 and 15
characters to the index. Without XCLASSing you can find a Riffhai by
searching for 'hai' or 'iff' or 'ffha' :-)
This is of course only an option if you know what you're doing, but that
is also the case if you start XCLASSing the solr extension.
--
Jigal van Hemert
TYPO3 Core Team member
TYPO3 .... inspiring people to share!
Get involved: typo3.org
More information about the TYPO3-project-solr
mailing list