[TYPO3-Solr] Indexing external domains with Apache Nutch

Carola Midde midde at redkiwi.nl
Thu Nov 14 09:52:14 CET 2013


Hi all,

I'm using the plugin Apache Nutch for TYPO3 (from dkd) but i do have some
problems with crawling external domains.

If i use the following command:
bin/nutch crawl urls -solr http://<ip>:8080/solr/ -depth 2 -dir
<directory_name>

It returns a Bad gateway error after a few minutes. It seems like the
memory usage of tomcat and java rises to over 4.2GB.
Someone has an idea how this can happen? And how to fix this?

There is also another strange thing. Crawling was working fine a few weeks
ago but if i now run the crawl command it shows me this:

crawl started in: <directory_name>
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://<ip>:8080/solr/
Injector: starting at 2013-11-14 09:42:30
Injector: crawlDb: <directory_name>/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-11-14 09:42:47, elapsed: 00:00:17
Generator: starting at 2013-11-14 09:42:47
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: <directory_name>

But if i use the same command with another directory_name, it starts to
crawl (but then returns a Bad gateway error)

I hope someone can help me :)

Greetings,

Carola Midde


More information about the TYPO3-project-solr mailing list