[TYPO3-Solr] Multi-core nutch

Lienhart Woitok Lienhart.Woitok at netlogix.de
Wed Jun 4 11:42:55 CEST 2014


Hi Jigal,

true, it would lead to problems if several crawling jobs run in parallel.
But I do not want that anyway and have a locking mechanism in place to
ensure only one nutch instance is running at a given time.

I choose to only copy specific configuration files instead of the whole
conf directory because then I can change all other files in one central place.
In my case I do not need to change nutch-site.xml and wanted to have it only
once.

Regards,


Lienhart Woitok
Web-Entwickler

Telefon: +49 (911) 539909 - 0
E-Mail: Lienhart.Woitok at netlogix.de
Website: media.netlogix.de



-----------------------------
Citrix XenApp & Desktop 7.5 – Das Wichtigste in einem Tag
Lernen Sie die neue Version kennen. Jetzt anmelden zum netlogix 79er Seminar am 26.06.2014 für nur 79.- EUR:
Jetzt anmelden: http://it-training.netlogix.de/angebote/79ers/citrix-xendesktop-75
------------------------------------



--
netlogix GmbH & Co. KG
IT-Services | IT-Training | Media
Neuwieder Straße 10 | 90411 Nürnberg
Telefon: +49 (911) 539909 - 0 | Fax: +49 (911) 539909 - 99
E-Mail: info at netlogix.de | Internet: http://www.netlogix.de

netlogix GmbH & Co. KG ist eingetragen am Amtsgericht Nürnberg (HRA 13338)
Persönlich haftende Gesellschafterin: netlogix Verwaltungs GmbH (HRB 20634)
Umsatzsteuer-Identifikationsnummer: DE 233472254
Geschäftsführer: Stefan Buchta, Matthias Schmidt



-----Ursprüngliche Nachricht-----
Von: typo3-project-solr-bounces at lists.typo3.org [mailto:typo3-project-solr-bounces at lists.typo3.org] Im Auftrag von Jigal van Hemert
Gesendet: Mittwoch, 4. Juni 2014 11:16
An: typo3-project-solr at lists.typo3.org
Betreff: Re: [TYPO3-Solr] Multi-core nutch

Hi,

On 4-6-2014 10:53, Lienhart Woitok wrote:
> I use a script that loops over all langauges I want to index, replaces some
> config files and then runs bin/crawl with the correct solr core for a language.
>
> for LANGUAGE in $LANGUAGES ; do
>          SOLR_URL="${SOLR_BASE_URL}-${LANGUAGE}"
>          SEEDDIR="urls-${LANGUAGE}"
>          CRAWL_PATH="crawl-${LANGUAGE}"
>          REGEX_URLFILTER="conf/regex-urlfilter.txt"
>          REGEX_URLFILTER_LANGUAGE="${REGEX_URLFILTER}.${LANGUAGE}"
>          cp "${REGEX_URLFILTER_LANGUAGE}" "${REGEX_URLFILTER}"
>          bin/crawl "${SEEDDIR}" "${CRAWL_PATH}" "${SOLR_URL}" "${LIMIT}"
> Done

So, you're actually copying the configuration file(s) to the one to use.
This may lead to problems if during an indexing job (which can take some
time) another job is started.
Also, setting a configuration directory each time would mean a lot of
freedom with which files to use for each job.

I found some articles/mails mentioning the variable NUTCH_CONF_DIR, but
this doesn't seem to work:

export
NUTCH_CONF_DIR=/opt/solr-tomcat/apache-nutch-for-typo3/configurations/myconf

bin/crawl urls/myconf crawls/myconf http://solr.server.ext/solr/mycore 3

In bin/nutch it seems to be used:
# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to
$NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar

But for some reason it doesn't result it Nutch using this directory to
pick up the configuration.

Alternative would be to have a nutch installation per site/core which
isn't very appealing either :-(

--
Jigal van Hemert
TYPO3 CMS Active Contributor

TYPO3 .... inspiring people to share!
Get involved: typo3.org
_______________________________________________
TYPO3-project-solr mailing list
TYPO3-project-solr at lists.typo3.org
http://lists.typo3.org/cgi-bin/mailman/listinfo/typo3-project-solr


More information about the TYPO3-project-solr mailing list