[TYPO3-Solr] Problem: the same page appears twice in search results

Mon Mar 4 12:41:12 CET 2013

Hi!

Irene Eglin wrote:
> In our case we have about 1000 different combinations of usergroups.
> With indexed_search+crawler we were not able to handle this because we
> did not want to have so many crawlers. We only indexed public pages -
> and told people to search only when not logged in.

You do not need that many crawlers :) The page should be indexed only for 
those user group combinations that include page groups. Content groups are 
not important because if some content group is not in the page group, the 
user cannot see the page. The visibility of the content is defined as follows:
- current user's groups are compared to the page groups without considering 
the rootline (one difference from Solr)
- if there is a match, than content generation starts
- only content that matches user's groups is shown

An example. Suppose you have:
- user #1 groups: 1,5,13,68
- user #2 groups: 2,7,13,65
- page groups: 3,7,13,65,68
- content groups: 0,2,13,65

Nobody ever will see the content with the group 2 on this page because page 
permissions prevent this. However it will be in the Solr index.

The actual page that the user #1 will see will consist from elements with 
groups 0 and 13 only (intersection of all groups + 0). However in solr it 
will be four entries for all content elements. This is how 
tx_solr_indexqueue_PageIndexer::index() works: it adds an entry to the 
index for each content group, even if that entry can never be shown.

User #2 will see the content from groups 0, 13 and 65. This page variant 
should be in the index as well.

So for the example above, there should be only two versions of the page in 
the solr index, with access field set to 1,5,13,68 and 2,7,13,65. When the 
logged in user #1 searches, EXT:solr should just made a filter that looks 
like 'access:"1,5,13,68"' and it will get results for that user (page 
variant #1).

In your case I do not think it will be 1000 results. It will be 0 or 1 less 
than you have now but it will remove duplication issue. The problem with 
the old crawler is that you had to specify those groups manually. We made 
an ext years ago @ snowflake that automated it. It is quite easy and it 
worked well.

Another problem with the old crawler and indexed search is that both exts 
load too much data into memory (for example, the whole page tree for all 
groups) and fail with out of memory errors. EXT:solr does a much better job 
with memory (many thanks to Ingo's great job!). But I think that Solr could 
take that part of the user group handling from the old code. As I wrote 
before, I believe it would remove duplicated *and* the necessity to have 
the access filter.

But I am ok if it stays like this. We noted the issue in our knowledge base 
and we will ask clients to create content in a certain way to avoid 
duplication issues. It cannot solve issues for existing clients but it can 
for new ones :)

In any case, I am happy to work with Solr. It is a great piece of software 
from DKD!

-- 
Dmitry Dulepov
TYPO3 CMS core & security teams member

Love gorillas.