[TYPO3-Solr] Multi-core nutch

Jigal van Hemert jigal.van.hemert at typo3.org
Wed Jun 4 11:16:17 CEST 2014


Hi,

On 4-6-2014 10:53, Lienhart Woitok wrote:
> I use a script that loops over all langauges I want to index, replaces some
> config files and then runs bin/crawl with the correct solr core for a language.
>
> for LANGUAGE in $LANGUAGES ; do
>          SOLR_URL="${SOLR_BASE_URL}-${LANGUAGE}"
>          SEEDDIR="urls-${LANGUAGE}"
>          CRAWL_PATH="crawl-${LANGUAGE}"
>          REGEX_URLFILTER="conf/regex-urlfilter.txt"
>          REGEX_URLFILTER_LANGUAGE="${REGEX_URLFILTER}.${LANGUAGE}"
>          cp "${REGEX_URLFILTER_LANGUAGE}" "${REGEX_URLFILTER}"
>          bin/crawl "${SEEDDIR}" "${CRAWL_PATH}" "${SOLR_URL}" "${LIMIT}"
> Done

So, you're actually copying the configuration file(s) to the one to use. 
This may lead to problems if during an indexing job (which can take some 
time) another job is started.
Also, setting a configuration directory each time would mean a lot of 
freedom with which files to use for each job.

I found some articles/mails mentioning the variable NUTCH_CONF_DIR, but 
this doesn't seem to work:

export 
NUTCH_CONF_DIR=/opt/solr-tomcat/apache-nutch-for-typo3/configurations/myconf

bin/crawl urls/myconf crawls/myconf http://solr.server.ext/solr/mycore 3

In bin/nutch it seems to be used:
# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to 
$NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar

But for some reason it doesn't result it Nutch using this directory to 
pick up the configuration.

Alternative would be to have a nutch installation per site/core which 
isn't very appealing either :-(

-- 
Jigal van Hemert
TYPO3 CMS Active Contributor

TYPO3 .... inspiring people to share!
Get involved: typo3.org


More information about the TYPO3-project-solr mailing list