[TYPO3-Solr] Multi-core nutch
Jigal van Hemert
jigal.van.hemert at typo3.org
Wed Jun 4 11:16:17 CEST 2014
Hi,
On 4-6-2014 10:53, Lienhart Woitok wrote:
> I use a script that loops over all langauges I want to index, replaces some
> config files and then runs bin/crawl with the correct solr core for a language.
>
> for LANGUAGE in $LANGUAGES ; do
> SOLR_URL="${SOLR_BASE_URL}-${LANGUAGE}"
> SEEDDIR="urls-${LANGUAGE}"
> CRAWL_PATH="crawl-${LANGUAGE}"
> REGEX_URLFILTER="conf/regex-urlfilter.txt"
> REGEX_URLFILTER_LANGUAGE="${REGEX_URLFILTER}.${LANGUAGE}"
> cp "${REGEX_URLFILTER_LANGUAGE}" "${REGEX_URLFILTER}"
> bin/crawl "${SEEDDIR}" "${CRAWL_PATH}" "${SOLR_URL}" "${LIMIT}"
> Done
So, you're actually copying the configuration file(s) to the one to use.
This may lead to problems if during an indexing job (which can take some
time) another job is started.
Also, setting a configuration directory each time would mean a lot of
freedom with which files to use for each job.
I found some articles/mails mentioning the variable NUTCH_CONF_DIR, but
this doesn't seem to work:
export
NUTCH_CONF_DIR=/opt/solr-tomcat/apache-nutch-for-typo3/configurations/myconf
bin/crawl urls/myconf crawls/myconf http://solr.server.ext/solr/mycore 3
In bin/nutch it seems to be used:
# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to
$NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
But for some reason it doesn't result it Nutch using this directory to
pick up the configuration.
Alternative would be to have a nutch installation per site/core which
isn't very appealing either :-(
--
Jigal van Hemert
TYPO3 CMS Active Contributor
TYPO3 .... inspiring people to share!
Get involved: typo3.org
More information about the TYPO3-project-solr
mailing list