[TYPO3-Solr] Problem: the same page appears twice in search results
Dmitry Dulepov
dmitry.dulepov at gmail.com
Mon Mar 4 12:41:12 CET 2013
Hi!
Irene Eglin wrote:
> In our case we have about 1000 different combinations of usergroups.
> With indexed_search+crawler we were not able to handle this because we
> did not want to have so many crawlers. We only indexed public pages -
> and told people to search only when not logged in.
You do not need that many crawlers :) The page should be indexed only for
those user group combinations that include page groups. Content groups are
not important because if some content group is not in the page group, the
user cannot see the page. The visibility of the content is defined as follows:
- current user's groups are compared to the page groups without considering
the rootline (one difference from Solr)
- if there is a match, than content generation starts
- only content that matches user's groups is shown
An example. Suppose you have:
- user #1 groups: 1,5,13,68
- user #2 groups: 2,7,13,65
- page groups: 3,7,13,65,68
- content groups: 0,2,13,65
Nobody ever will see the content with the group 2 on this page because page
permissions prevent this. However it will be in the Solr index.
The actual page that the user #1 will see will consist from elements with
groups 0 and 13 only (intersection of all groups + 0). However in solr it
will be four entries for all content elements. This is how
tx_solr_indexqueue_PageIndexer::index() works: it adds an entry to the
index for each content group, even if that entry can never be shown.
User #2 will see the content from groups 0, 13 and 65. This page variant
should be in the index as well.
So for the example above, there should be only two versions of the page in
the solr index, with access field set to 1,5,13,68 and 2,7,13,65. When the
logged in user #1 searches, EXT:solr should just made a filter that looks
like 'access:"1,5,13,68"' and it will get results for that user (page
variant #1).
In your case I do not think it will be 1000 results. It will be 0 or 1 less
than you have now but it will remove duplication issue. The problem with
the old crawler is that you had to specify those groups manually. We made
an ext years ago @ snowflake that automated it. It is quite easy and it
worked well.
Another problem with the old crawler and indexed search is that both exts
load too much data into memory (for example, the whole page tree for all
groups) and fail with out of memory errors. EXT:solr does a much better job
with memory (many thanks to Ingo's great job!). But I think that Solr could
take that part of the user group handling from the old code. As I wrote
before, I believe it would remove duplicated *and* the necessity to have
the access filter.
But I am ok if it stays like this. We noted the issue in our knowledge base
and we will ask clients to create content in a certain way to avoid
duplication issues. It cannot solve issues for existing clients but it can
for new ones :)
In any case, I am happy to work with Solr. It is a great piece of software
from DKD!
--
Dmitry Dulepov
TYPO3 CMS core & security teams member
Love gorillas.
More information about the TYPO3-project-solr
mailing list