[TYPO3-core] RFC #13732: External URL only indexes first page [indexed_search]
Xavier Perseguers
typo3 at perseguers.ch
Mon Mar 8 15:42:27 CET 2010
Hi,
This is a SVN patch request.
Type: Bugfix
Branches: trunk, 4-3
Bugtracker reference:
http://bugs.typo3.org/view.php?id=13732
Problem:
When indexing an external URL/website, the first page is indexed but no
subpage of the external website.
Problem is related to relative links vs absolute (w/ scheme) in
hyperlinks. Today's websites often use relative links:
<a href="some/relative/page.html">....
instead of
<a href="http://www.domain.tld/subsite/some/relative/page.html">
Problem is that EXT:indexed_search/class.crawler.php in method
indexExtUrl() is not able to properly convert from relative link to
absolute when dealing with external websites. In such cases, the URL
above will be converted to
http://www.domain.tld/some/relative/page.html
Please note the missing "/subsite/" part in the computed full url.
Solution:
According to [1], conversion from relative url to full url should first
try to use a "base href" tag if present and then rely on implicit
relative url with enclosing path.
The patch tries to extract the base href, if present, and otherwise use
the same mechanism as before patch but do not forget to append the path
after the domain name (and make sure to remove any ".html" page that may
be given as base URL in the indexing configuration to only return the
enclosing path or "parent directory").
Note:
The revision number in my patch is against latest version even if it
seems to be against an old revision ;)
Cheers
[1] http://www.w3.org/TR/html401/struct/links.html#h-12.4
--
Xavier Perseguers
http://xavier.perseguers.ch/en
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 13732.diff
URL: <http://lists.typo3.org/pipermail/typo3-team-core/attachments/20100308/a40f8902/attachment.txt>
More information about the TYPO3-team-core
mailing list