[TYPO3-english] "Crawler" and custom extensions

Victor Livakovsky v-tyok at mail.ru
Tue Oct 13 23:25:07 CEST 2009


Hi, List.

Here is my situation: I'm using a bit modified 'sici_damdl' extension, which 
produces links to file assets in this way:
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=123456789abcdef&tx_sicidamdl_pi1[name]=asset.pdf
When user clicks on such link, he receives desired PDF file.

Everything is okay, since I need an ability to index contents of these PDF 
files. With default indexed_search behavior it's not possible, because 
search can't recognize, that link above is not a page, but a PDF file.
config.index_externals = 1 - also doesn't help.

That's why I decided to use 'crawler' for this case. I created 'Crawler 
Configuration' record at the root page and named it "Download assets", 
checked "Re-indexing [tx_indexedsearch_reindex]", added this configuration: 
"&tx_sicidamdl_pi1[hash]=[_TABLE:tx_dam;_PID:1;_FIELD:file_hash]&tx_sicidamdl_pi1[name]=[_TABLE:tx_dam;_PID:1;_FIELD:file_dl_name]", 
and set base url.


And the first problem I faced with: crawler creates urls with combinations 
of parameters, but not with parameters, which belongs to same record.
F.e. I have such records in 'tx_dam' table:
file_hash    file_dl_name
12345        first_asset.pdf
67890        second_asset
abcdef        third_asset

And I get such urls:
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=12345&tx_sicidamdl_pi1[name]=first_asset.pdf
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=67890&tx_sicidamdl_pi1[name]=first_asset.pdf
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=abcdef&tx_sicidamdl_pi1[name]=first_asset.pdf
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=12345&tx_sicidamdl_pi1[name]=second_asset.pdf
http://domain.tld/index.php?id=1&tx_sicidamdl_pi1[hash]=67890&tx_sicidamdl_pi1[name]=second_asset.pdf
and so on...

How can I set 'crawler' to create urls, that are not different combinations 
of parameters, but parameters of a same record?


Second problem. 'Crawler' creates these urls for every page of the web.
Yes, I can restrict it for creation of urls only for some specific pages by 
setting 'pidsOnly' in parameters, but 'sici_damdl' may be placed at any page 
by the web editor, so I can't use this restriction, but also I don't want 
'crawler' to produce lot of non-existing urls.
Maybe it's somehow possible to make 'crawler' grab urls from page, but not 
generate them?


Third problem. Form lot of urls I found correct one, pressed "Read" button, 
but got "Error: ..." in 'Status' column. Seems, that 'indexed_search' still 
can't index such type of urls. Or I'm doing something wrong.

TYPO3 4.2.8.
crawler 3.0.0

Sorry for such a big post and dummy questions, but I can't find an answer by 
myself only...

And thank you in advance for any help or hints! 



More information about the TYPO3-english mailing list