[TYPO3-Solr] Solr finds too many subwords

Irene Eglin irene.eglin at unibas.ch
Fri Jan 18 10:50:32 CET 2013


Hi Raka

Is this still an open issue for you?
Below I will describe what happens and how to change the behaviour.

 > in general it's likeable that you can search for wordparts but as it is
 > it's just over the top.
 > for example:
 > searching for "fisch" will result in many hits with
 > "spezifisch","demografisch","biografisch",...
 > or a search for "affe" will result in "schaffen","beschaffenheit",...
 > of course there are also some useful results like "menschenaffen",
 > "tintenfisch", ...
 > but as it is at the moment, it's just useless because there are far too
 > many results that are simply wrong.

 > is it possible to disable this behaviour so that solr only finds the
 > whole word, and perhaps the plural form and some grammatical cases?
 > or is it somehow possible to improve this matching?

I only use german core, so don't know about other languages.

In schema.xml it is defined that the 
"DictionaryCompoundWordTokenFilterFactory" should be used while indexing.

This Factory takes the term to index and searches in 
"german-common-nouns.txt" for all the words that are part of this term. 
Those are getting indexed as well.

This results in - for example -
- poster, kloster, osteria all getting the indexterm oster
- fortsetzung, reports -> orts
- spezifisch -> fisch
etc.
(oster, orts, fisch are part of german-common-nouns.txt)

How to change this:
- I just replaced the existing german-common-nouns with an empty file
	(don't forget to reindex)
	(now also "likeable" compound words are not found any more, but
		for us precision ist more important than recall)
- You could also search for a better Wordlist (tell me if you find one ;-)
- If you know what you do: deactivate the 
DicitionaryCompoundWordTokenFilterFactory in the schema.xml or look for 
a "better" (for your needs) solr-factory to use

Hope this helps

Irene


More information about the TYPO3-project-solr mailing list