[TYPO3] Indexing external files with crawler

Diego Pino Garcia dpino at igalia.com
Mon Aug 6 17:33:19 CEST 2007


Hi Jan!


>
> I am setting up a site with typo3 on ubuntu (both latest version). Ubuntu
> is loaded as a virtual machine if that makes any difference for my
> situation.
> All is working except indexing of external files. Now I don't know if I
> just don't understand how this is suppose to work, or if I miss-configured
> something.
>
> I have followed this tutorial: http://wiki.typo3.org/index.php/Ext_crawler
>
> Now:
> - I have installed all the programs for parsing (pdfinfo, unzip, ...)
> - I have installed php5-cli
> - I have setup the cron job for
> typo3conf/ext/crawler/cli/crawler_cli.phpsh, and guessing from the log files
> it is running
> - I have created the _cli_crawler BE user
> - I have put the TSconfig from the link above in to my root page
> - I have created a not in menu page under the root page
> - I have created a indexing configuration (type=external files) on the
> above page, that points to "files/" under fileadmin (must I type
> "fileadmin/files/" or is "files/" enough?)
>
> Now I have tried something: I have created a simple content and created a
> link to a PDF file that is somewhere in fileadmin. If I then go to
> Web->Info->Crawler and click refresh next to the page that the content is on
> (and after that click refresh on all the entries that appear bellow that
> page), I can find that file using search in FE (so indexing of files works).

As far as I know, crawler do not crawls system folders on your tree for indexing. To check that, just:
Info->Site Crawler.
Click on the root node of you tree, and select Infite.
Press Crawl URLs.

That will build all the URLs for your site (based on your typoscript configuration). You can see then, that there is not URL built for system folder, so its contents are never indexed. I do not whether there is any option or some typoscript you may use to do crawl system folders.

>
>
> But I can't figure out how to configure the crawler to index files under
> "fileadmin/files/" automatically (say every day at a given hour).
> Can somebody please help me with this? I have been struggling with this
> for a couple of days now without much success.
>

If you are OK with making new pages and pointing to external contents stored on your fileadmin/files/, then you could just simply add a new cron task to crawl and index your contents everyday at a certain time od the day.

Instead of the cli/crawler_cli.phpsh use cli_dispacth.php (this is preferred since Typo4.1). Calling cli_dispatch allows you to pass conf parameters. You set the cript to start crawling from a certain PID or setting crawl depth. Please, check http://typo3.org/documentation/document-library/extension-manuals/crawler/2.0.0/view/1/3/ for more info. For instance,

/typo3_src-4.1/typo3/cli_dispatch.phpsh crawler_im 3 -d 1 -proc tx_indexedsearch_reindex

(will crawl from PID 3, one level down)

Then add parameter -o exec, to process the queu right away.

Set a new task at your cronttab to perform this thing everyday at midnight for example.


0 0 * * * *  /typo3_src-4.1/typo3/cli_dispatch crawler_im 3 -d 1 -proc tx_indexedsearch_reindex -o exec

And that's all. I do not know whether this was what you were asking for, I hope, at least, it has brought a shed of light.

Best regards,

Diego


More information about the TYPO3-english mailing list