[TYPO3-core] RFC #13732: External URL only indexes first page [indexed_search]

Xavier Perseguers typo3 at perseguers.ch
Mon Apr 5 18:12:18 CEST 2010


Hi,

REMINDER #2

Questions and remarks of this sub-thread should be tackled with


On 03/23/10 8:59 , Xavier Perseguers wrote:
> Hi,
>
> REMINDER #1
>
>
>
> On 08.03.2010 15:42, Xavier Perseguers wrote:
>> Hi,
>>
>> This is a SVN patch request.
>>
>> Type: Bugfix
>>
>> Branches: trunk, 4-3
>>
>> Bugtracker reference:
>> http://bugs.typo3.org/view.php?id=13732
>>
>> Problem:
>> When indexing an external URL/website, the first page is indexed but no
>> subpage of the external website.
>>
>> Problem is related to relative links vs absolute (w/ scheme) in
>> hyperlinks. Today's websites often use relative links:
>>
>> <a href="some/relative/page.html">....
>>
>> instead of
>>
>> <a href="http://www.domain.tld/subsite/some/relative/page.html">
>>
>> Problem is that EXT:indexed_search/class.crawler.php in method
>> indexExtUrl() is not able to properly convert from relative link to
>> absolute when dealing with external websites. In such cases, the URL
>> above will be converted to
>>
>> http://www.domain.tld/some/relative/page.html
>>
>> Please note the missing "/subsite/" part in the computed full url.
>>
>> Solution:
>> According to [1], conversion from relative url to full url should first
>> try to use a "base href" tag if present and then rely on implicit
>> relative url with enclosing path.
>>
>> The patch tries to extract the base href, if present, and otherwise use
>> the same mechanism as before patch but do not forget to append the path
>> after the domain name (and make sure to remove any ".html" page that may
>> be given as base URL in the indexing configuration to only return the
>> enclosing path or "parent directory").
>>
>> Note:
>> The revision number in my patch is against latest version even if it
>> seems to be against an old revision ;)
>>
>> Cheers
>>
>> [1] http://www.w3.org/TR/html401/struct/links.html#h-12.4
>>
>
>


-- 
Xavier Perseguers
http://xavier.perseguers.ch/en


More information about the TYPO3-team-core mailing list