[TYPO3-Solr] Multi-core nutch

Jigal van Hemert jigal.van.hemert at typo3.org
Thu Aug 7 16:54:53 CEST 2014


Hi,

On 4-6-2014 9:08, Jigal van Hemert wrote:
> The pre-compiled apache-nutch-for-typo3 works great! Currently I have no
> idea how to use it with multiple cores (and multiple sites) on the same
> solr/nutch server installation.
>
> Is there a way to use multiple configurations on a single Nutch
> installation (at least use different nutch-site.xml and
> regex-urlfilter.tx)?

For the archives:

- make directories configuration/site1, configuration/site2, and so on
- copy contents of conf directory into each of the directories you just 
created. Now you can adjust all configuration files for each site
- make directories urls/site1, urls/site2, and so on
- put a seed.txt in each of the directories you just created. Now you 
can set the starting URLs for each configuration

Crawling:

cd /opt/solr-tomcat/apache-nutch-for-typo3/
export JAVA_HOME=<<<PATH_TO_JRE>>>
export 
NUTCH_CONF_DIR=/opt/solr-tomcat/apache-nutch-for-typo3/configuration/site1
bin/crawl urls/site1 crawls/site1 http://127.0.0.1:8080/solr/site1core

The NUTCH_CONF_DIR tells the crawl script to use the configuration in 
the specified directory. Use different crawls directories to separate 
the crawl data of each site.

A crawl session can take quite some time (hours), so plan your cron jobs 
carefully :-)

-- 
Jigal van Hemert
TYPO3 CMS Active Contributor

TYPO3 .... inspiring people to share!
Get involved: typo3.org


More information about the TYPO3-project-solr mailing list