[TYPO3-Solr] Multi-core nutch
Jigal van Hemert
jigal.van.hemert at typo3.org
Thu Aug 7 16:54:53 CEST 2014
Hi,
On 4-6-2014 9:08, Jigal van Hemert wrote:
> The pre-compiled apache-nutch-for-typo3 works great! Currently I have no
> idea how to use it with multiple cores (and multiple sites) on the same
> solr/nutch server installation.
>
> Is there a way to use multiple configurations on a single Nutch
> installation (at least use different nutch-site.xml and
> regex-urlfilter.tx)?
For the archives:
- make directories configuration/site1, configuration/site2, and so on
- copy contents of conf directory into each of the directories you just
created. Now you can adjust all configuration files for each site
- make directories urls/site1, urls/site2, and so on
- put a seed.txt in each of the directories you just created. Now you
can set the starting URLs for each configuration
Crawling:
cd /opt/solr-tomcat/apache-nutch-for-typo3/
export JAVA_HOME=<<<PATH_TO_JRE>>>
export
NUTCH_CONF_DIR=/opt/solr-tomcat/apache-nutch-for-typo3/configuration/site1
bin/crawl urls/site1 crawls/site1 http://127.0.0.1:8080/solr/site1core
The NUTCH_CONF_DIR tells the crawl script to use the configuration in
the specified directory. Use different crawls directories to separate
the crawl data of each site.
A crawl session can take quite some time (hours), so plan your cron jobs
carefully :-)
--
Jigal van Hemert
TYPO3 CMS Active Contributor
TYPO3 .... inspiring people to share!
Get involved: typo3.org
More information about the TYPO3-project-solr
mailing list