[TYPO3-core] RFC #13732: External URL only indexes first page [indexed_search]

Xavier Perseguers typo3 at perseguers.ch
Mon Mar 8 16:47:02 CET 2010


Hi,

Sorry :-/ after testing with other websites, I found out that my patch 
did not handle "relative" links which are absolute:

<a href="/somepage.html">

That is, starting with a slash '/' and as such relative to the hostname. 
v2 takes care of this meaning all links are now handled properly:

Full link:
<a href="http://www.domain.tld/subsite/some-page.html">

Relative link (takes base href or computed base href into account):
<a href="some-other-page.html">

Absolute link:
<a href="/subsite/some-other-page.html">

Cheers

On 08.03.2010 15:42, Xavier Perseguers wrote:
> Hi,
>
> This is a SVN patch request.
>
> Type: Bugfix
>
> Branches: trunk, 4-3
>
> Bugtracker reference:
> http://bugs.typo3.org/view.php?id=13732
>
> Problem:
> When indexing an external URL/website, the first page is indexed but no
> subpage of the external website.
>
> Problem is related to relative links vs absolute (w/ scheme) in
> hyperlinks. Today's websites often use relative links:
>
> <a href="some/relative/page.html">....
>
> instead of
>
> <a href="http://www.domain.tld/subsite/some/relative/page.html">
>
> Problem is that EXT:indexed_search/class.crawler.php in method
> indexExtUrl() is not able to properly convert from relative link to
> absolute when dealing with external websites. In such cases, the URL
> above will be converted to
>
> http://www.domain.tld/some/relative/page.html
>
> Please note the missing "/subsite/" part in the computed full url.
>
> Solution:
> According to [1], conversion from relative url to full url should first
> try to use a "base href" tag if present and then rely on implicit
> relative url with enclosing path.
>
> The patch tries to extract the base href, if present, and otherwise use
> the same mechanism as before patch but do not forget to append the path
> after the domain name (and make sure to remove any ".html" page that may
> be given as base URL in the indexing configuration to only return the
> enclosing path or "parent directory").
>
> Note:
> The revision number in my patch is against latest version even if it
> seems to be against an old revision ;)
>
> Cheers
>
> [1] http://www.w3.org/TR/html401/struct/links.html#h-12.4
>


-- 
Xavier Perseguers
http://xavier.perseguers.ch/en
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 13732_v2.diff
URL: <http://lists.typo3.org/pipermail/typo3-team-core/attachments/20100308/9c82051b/attachment.txt>


More information about the TYPO3-team-core mailing list