[TYPO3-core] RFC #13732: External URL only indexes first page [indexed_search]

Xavier Perseguers typo3 at perseguers.ch
Mon Mar 8 15:42:27 CET 2010


Hi,

This is a SVN patch request.

Type: Bugfix

Branches: trunk, 4-3

Bugtracker reference:
http://bugs.typo3.org/view.php?id=13732

Problem:
When indexing an external URL/website, the first page is indexed but no 
subpage of the external website.

Problem is related to relative links vs absolute (w/ scheme) in 
hyperlinks. Today's websites often use relative links:

<a href="some/relative/page.html">....

instead of

<a href="http://www.domain.tld/subsite/some/relative/page.html">

Problem is that EXT:indexed_search/class.crawler.php in method 
indexExtUrl() is not able to properly convert from relative link to 
absolute when dealing with external websites. In such cases, the URL 
above will be converted to

http://www.domain.tld/some/relative/page.html

Please note the missing "/subsite/" part in the computed full url.

Solution:
According to [1], conversion from relative url to full url should first 
try to use a "base href" tag if present and then rely on implicit 
relative url with enclosing path.

The patch tries to extract the base href, if present, and otherwise use 
the same mechanism as before patch but do not forget to append the path 
after the domain name (and make sure to remove any ".html" page that may 
be given as base URL in the indexing configuration to only return the 
enclosing path or "parent directory").

Note:
The revision number in my patch is against latest version even if it 
seems to be against an old revision ;)

Cheers

[1] http://www.w3.org/TR/html401/struct/links.html#h-12.4

-- 
Xavier Perseguers
http://xavier.perseguers.ch/en
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 13732.diff
URL: <http://lists.typo3.org/pipermail/typo3-team-core/attachments/20100308/a40f8902/attachment.txt>


More information about the TYPO3-team-core mailing list