[TYPO3-english] "Crawler" and custom extensions
Victor Livakovsky
v-tyok at mail.ru
Tue Oct 13 23:25:07 CEST 2009
Hi, List.
Here is my situation: I'm using a bit modified 'sici_damdl' extension, which
produces links to file assets in this way:
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=123456789abcdef&tx_sicidamdl_pi1[name]=asset.pdf
When user clicks on such link, he receives desired PDF file.
Everything is okay, since I need an ability to index contents of these PDF
files. With default indexed_search behavior it's not possible, because
search can't recognize, that link above is not a page, but a PDF file.
config.index_externals = 1 - also doesn't help.
That's why I decided to use 'crawler' for this case. I created 'Crawler
Configuration' record at the root page and named it "Download assets",
checked "Re-indexing [tx_indexedsearch_reindex]", added this configuration:
"&tx_sicidamdl_pi1[hash]=[_TABLE:tx_dam;_PID:1;_FIELD:file_hash]&tx_sicidamdl_pi1[name]=[_TABLE:tx_dam;_PID:1;_FIELD:file_dl_name]",
and set base url.
And the first problem I faced with: crawler creates urls with combinations
of parameters, but not with parameters, which belongs to same record.
F.e. I have such records in 'tx_dam' table:
file_hash file_dl_name
12345 first_asset.pdf
67890 second_asset
abcdef third_asset
And I get such urls:
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=12345&tx_sicidamdl_pi1[name]=first_asset.pdf
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=67890&tx_sicidamdl_pi1[name]=first_asset.pdf
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=abcdef&tx_sicidamdl_pi1[name]=first_asset.pdf
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=12345&tx_sicidamdl_pi1[name]=second_asset.pdf
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=67890&tx_sicidamdl_pi1[name]=second_asset.pdf
and so on...
How can I set 'crawler' to create urls, that are not different combinations
of parameters, but parameters of a same record?
Second problem. 'Crawler' creates these urls for every page of the web.
Yes, I can restrict it for creation of urls only for some specific pages by
setting 'pidsOnly' in parameters, but 'sici_damdl' may be placed at any page
by the web editor, so I can't use this restriction, but also I don't want
'crawler' to produce lot of non-existing urls.
Maybe it's somehow possible to make 'crawler' grab urls from page, but not
generate them?
Third problem. Form lot of urls I found correct one, pressed "Read" button,
but got "Error: ..." in 'Status' column. Seems, that 'indexed_search' still
can't index such type of urls. Or I'm doing something wrong.
TYPO3 4.2.8.
crawler 3.0.0
Sorry for such a big post and dummy questions, but I can't find an answer by
myself only...
And thank you in advance for any help or hints!
More information about the TYPO3-english
mailing list