[TYPO3-english] Crawler and external documents
Claudio Strizzolo
claudio.strizzolo at ts.nogarb.ageinfn.it
Wed Jan 28 09:26:51 CET 2009
Hi all
I'm trying to set up the crawler extension in order to index all the pages
in the site and the external documents (/fileadmin/...) linked by anchors
in the pages.
I read some documentation, included http://wiki.typo3.org/index.php/
Ext_crawler and almost everything works: the pages are correctly indexed,
and the external documents are recognized. In the Crawler Log they are
listed in separate rows under the page which points to them.
However, their status is ".." and their contents are not indexed. If I
click on the "Read" icon (it looks more like a reload icon, imho) the
content is correctly indexed and the status becomes "OK", but I could not
find a way to get this automatically through the crawler.
I have a huge number of documents linked in this way, therefore I would
like to index them without having to click on the Read icon for each of
them.
Is there a way to get this? Probably I missed something stupid in the
documentation, but I'm puzzled trying to figure it out.
This is the TS config in the root page of the site:
tx_crawler.crawlerCfg.paramSets {
whole_site =
whole_site {
cHash = 1
procInstrFilter = tx_indexedsearch_reindex, tx_indexedsearch_crawler
baseUrl = http://www.example.com/
}
language = &L=[|_TABLE:pages_language_overlay;_FIELD:sys_language_uid]
language {
procInstrFilter =tx_indexedsearch_reindex, tx_indexedsearch_crawler
baseUrl = http://www.example.com/
}
tt_news = &tx_ttnews[tt_news]=[_TABLE:tt_news;_PID:280]
tt_news {
procInstrFilter = tx_indexedsearch_reindex, tx_cachemgm_recache
cHash = 1
pidsOnly = 301
baseUrl = http://www.example.com/
}
}
This is how I run the crawler from the command line:
typo3/cli_dispatch.phpsh crawler_im 34 -d 99 -proc
tx_indexedsearch_reindex,tx_indexedsearch_crawler -n 2000 -o exec
Thanks in advance
Claudio
More information about the TYPO3-english
mailing list