[TYPO3-Solr] Nutch and sitehash on a site with SSL
Jigal van Hemert
jigal.van.hemert at typo3.org
Wed Jan 7 11:38:48 CET 2015
Hi Olivier,
On 06/01/2015 18:01, Olivier Dobberkau wrote:
> Am 05.01.15 um 16:43 schrieb Jigal van Hemert:
>
>> Could it be the case that java.net.URL (which is used in the siteHash
>> plugin) can't handle https?
>
> Please see:
>
> https://issues.apache.org/jira/browse/NUTCH-1676
>
> looks like there is patch for nutch 1.9
This is a patch for protocol-http to be able to index https sites. This
can easily be done by adding protocol-httpclient to plugin.includes in
nutch-site.xml (used that already on some sites). protocol-httpclient
supports https.
> Our Version of Nutch: https://github.com/dkd/nutch-typo3-cms is based on
> nutch 1.8 and includes the following changes: NUTCH-585.
NUTCH-585 is a very useful patch indeed to remove content before it's
indexed. It's a pity that this is not yet included in nutch itself.
The actual problem occurs when the siteHash needs to be fetched in [1].
In the log the URL that is constructed is correct. That URL also works
when used with for example wget on the actual solr server machine and
returns a block of json with the site hash.
Yet the log also reports "ERROR! could not connect to <URL>". The
function in [1] uses java.net.URL to open a stream to that URL. Somehow
this seems to fail.
I'm not much into Java to be able to debug or even compile this. Any
suggestions from anyone could be helpful here!
[1]
https://github.com/dkd/nutch-typo3-cms/blob/master/src/plugin/typo3-sitehash/src/java/org/typo3/nutch/indexer/sitehash/SiteHashIndexingFilter.java
--
Jigal van Hemert
TYPO3 CMS Active Contributor
TYPO3 .... inspiring people to share!
Get involved: typo3.org
More information about the TYPO3-project-solr
mailing list