[TYPO3-Solr] Nutch and sitehash on a site with SSL

Wed Jan 7 11:38:48 CET 2015

Hi Olivier,

On 06/01/2015 18:01, Olivier Dobberkau wrote:
> Am 05.01.15 um 16:43 schrieb Jigal van Hemert:
>
>> Could it be the case that java.net.URL (which is used in the siteHash
>> plugin) can't handle https?
>
> Please see:
>
> https://issues.apache.org/jira/browse/NUTCH-1676
>
> looks like there is patch for nutch 1.9

This is a patch for protocol-http to be able to index https sites. This 
can easily be done by adding protocol-httpclient to plugin.includes in 
nutch-site.xml (used that already on some sites). protocol-httpclient 
supports https.

> Our Version of Nutch: https://github.com/dkd/nutch-typo3-cms is based on
> nutch 1.8 and includes the following changes: NUTCH-585.

NUTCH-585 is a very useful patch indeed to remove content before it's 
indexed. It's a pity that this is not yet included in nutch itself.

The actual problem occurs when the siteHash needs to be fetched in [1]. 
In the log the URL that is constructed is correct. That URL also works 
when used with for example wget on the actual solr server machine and 
returns a block of json with the site hash.
Yet the log also reports "ERROR! could not connect to <URL>". The 
function in [1] uses java.net.URL to open a stream to that URL. Somehow 
this seems to fail.
I'm not much into Java to be able to debug or even compile this. Any 
suggestions from anyone could be helpful here!

[1] 
https://github.com/dkd/nutch-typo3-cms/blob/master/src/plugin/typo3-sitehash/src/java/org/typo3/nutch/indexer/sitehash/SiteHashIndexingFilter.java

-- 
Jigal van Hemert
TYPO3 CMS Active Contributor

TYPO3 .... inspiring people to share!
Get involved: typo3.org