[TYPO3-core] RFC #13732: External URL only indexes first page [indexed_search]

Xavier Perseguers typo3 at perseguers.ch
Tue Mar 16 18:54:07 CET 2010


REMINDER #1



On 03/08/10 16:47 , Xavier Perseguers wrote:
> Hi,
>
> Sorry :-/ after testing with other websites, I found out that my patch
> did not handle "relative" links which are absolute:
>
> <a href="/somepage.html">
>
> That is, starting with a slash '/' and as such relative to the hostname.
> v2 takes care of this meaning all links are now handled properly:
>
> Full link:
> <a href="http://www.domain.tld/subsite/some-page.html">
>
> Relative link (takes base href or computed base href into account):
> <a href="some-other-page.html">
>
> Absolute link:
> <a href="/subsite/some-other-page.html">
>
> Cheers
>
> On 08.03.2010 15:42, Xavier Perseguers wrote:
>> Hi,
>>
>> This is a SVN patch request.
>>
>> Type: Bugfix
>>
>> Branches: trunk, 4-3
>>
>> Bugtracker reference:
>> http://bugs.typo3.org/view.php?id=13732
>>
>> Problem:
>> When indexing an external URL/website, the first page is indexed but no
>> subpage of the external website.
>>
>> Problem is related to relative links vs absolute (w/ scheme) in
>> hyperlinks. Today's websites often use relative links:
>>
>> <a href="some/relative/page.html">....
>>
>> instead of
>>
>> <a href="http://www.domain.tld/subsite/some/relative/page.html">
>>
>> Problem is that EXT:indexed_search/class.crawler.php in method
>> indexExtUrl() is not able to properly convert from relative link to
>> absolute when dealing with external websites. In such cases, the URL
>> above will be converted to
>>
>> http://www.domain.tld/some/relative/page.html
>>
>> Please note the missing "/subsite/" part in the computed full url.
>>
>> Solution:
>> According to [1], conversion from relative url to full url should first
>> try to use a "base href" tag if present and then rely on implicit
>> relative url with enclosing path.
>>
>> The patch tries to extract the base href, if present, and otherwise use
>> the same mechanism as before patch but do not forget to append the path
>> after the domain name (and make sure to remove any ".html" page that may
>> be given as base URL in the indexing configuration to only return the
>> enclosing path or "parent directory").
>>
>> Note:
>> The revision number in my patch is against latest version even if it
>> seems to be against an old revision ;)
>>
>> Cheers
>>
>> [1] http://www.w3.org/TR/html401/struct/links.html#h-12.4
>>
>
>


-- 
Xavier Perseguers
http://xavier.perseguers.ch/en


More information about the TYPO3-team-core mailing list