[TYPO3-core] RFC #9400: Integrate jb_status_code in the TYPO3 core

Mon Sep 22 21:51:35 CEST 2008

Hi!

I apologize to the rest of the team for a very long post but it is necessary to prevent misunderstanding in this serious question about redirects.

Andreas Wolf wrote:
> Your statement would be correct - if it was against the specification.
> But it isn't: A temporary redirect (302) is interpreted as "the resource
> is at the redirect target now, but it may be somewhere else tomorrow".
> So the called URL itself is indexed, the target url is (often) deleted
> from the index.

So what did I say wrong? :)

> Example: I insert a redirect to typo3.org on
> www.mycoolhomepage.com/redirectme.php. Now $robot crawls this page,
> finds the redirect to typo3.org, with response code 302.
> As typo3.org could be j*****.org tomorrow [1], it only indexes
> www.mycoolhomepage.com/redirectme.php and - probably - deletes typo3.org
> from its index.

Completely wrong, see explanation after the next quote.

> So there is in fact no problem for my own homepage, but for the site I
> redirect to. That's why this phenomenon is called URL hijacking.
> 
> And in fact the search engines are right according to the HTTP spec.
> Just have a look at RFC 2616, section 10.3.3:
> 
>   Since the redirection might be altered on occasion, the client SHOULD
>   continue to use the Request-URI for future requests.

I know the spec very well, I had to implement HTTP protocol stacks in the past in Java and C++ :) And I afraid this interpretation is wrong. Nothing in the specification supports the above point of view.

According to the specification, the content of "www.mycoolhomepage.com/redirectme.php" *temporarily* resides at another place, namely "typo3.org". This status code means that requester should continue fetching "www.mycoolhomepage.com/redirectme.php" because its current redirect is not stable over time. The specification clear says it in your quote above. The redirect may cease to exist at any moment and normal content can be presented. For example, if I owned typo3.lv, I could 302 it to typo3.org but this will never make typo3.lv more important than typo3.org.

Read the specification carefully: it neither says that "typo3.org" and "www.mycoolhomepage.com/redirectme.php" from your example are the same thing, nor it says that "typo3.org" is made invalid by someone redirecting to it. "typo3.org" stays the main resource here and "www.mycoolhomepage.com/redirectme.php" is just a pointer to it. I do not know how that article in German get the idea but I think the interpretation is totally wrong there. I work with HTTP protocol since year 1998 and I never heard such nonsense.

Think yourself: if I make now a 302 redirect from some cheap hosting to www.cnn.com, does it mean www.cnn.com will cease to exist in search engines due to this "url hijacking"? Does that sound logical to you? This way we would not have any real results in search engines these days. What prevents hackers of all kind using it? Why do the need to register domains like www.vnn.com (v is next to c on the keyboard) if they could simply use 302 to catch domains?

* * *

After writing all of the above I went to Matt Cuts'es web site. I was sure he will describe it there. And I found it. Here is the link:
http://www.mattcutts.com/blog/seo-advice-discussing-302-redirects/

If you do not know who Matt Cuts is, Google for it ;)

Matt says:
======================
Google is moving to a set of heuristics that return the destination page more than 99% of the time. Why not 100% of the time? Most search engine reserve the right to make exceptions when we think the source page will be better for users, even though we’ll only do that rarely.
======================

As you see, this confirms my views but still there are rare "exceptions".

Next Matt discusses two URLs: sfgiants.com that uses 302 to sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=s. He says:
======================
Remember that sfgiants.com does a 302 redirect to a url on a different domain (sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf). And remember that reasonable people can disagree on which url should show up at #1. I’m not trying to criticize any search engine here, but rather trying to point out that this is a weird corner case.

Current Google behavior: we return sfgiants.com at #1. But we also return http://sanfrancisco.giants.mlb.com/NASApp/mlb/sf/homepage/sf_homepage.jsp at #3, as an uncrawled url, which is definitely poor/suboptimal.

Current Ask behavior: Ask returns giants.mlb.com/NASApp/mlb/sf/homepage/sf_homepage.jsp at #1, sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf at #2, and sanfrancisco.giants.mlb.com/NASApp/mlb/sf/homepage/sf_homepage.jsp at #3.

Current MSN behavior: MSN returns giants.mlb.com/NASApp/mlb/sf/homepage/sf_homepage.jsp at #1 and sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf at #2.

Current Yahoo! behavior: Yahoo! returns www.sfgiants.com at #1, but also returns sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf at #6. You might think that returning sfgiants.com at #1 isn’t what Yahoo! said that they would do with 302 off-domain redirects (i.e. always go with the destination), but if you read carefully, Yahoo! also reserves the right to make exceptions in handling redirects. That allows them to show a nice url at #1.
======================

This "hijacking" thing applies only and only to some search engine, only and only if shorter URL redirects to a longer URL. But still every search engine returns the right URL on the same search page. No one removes URL from the index!

An as you see from the description "http://www.mycoolhomepage.com/redirectme.php" has no chances to steal "http://typo3.org/" at all.

Given all this I think we can forget about this hijacking thing. It is much more important to provide web masters with a proper set of services including 302. They still may need 302 and they will find it very inconvenient if we decide we are more clever than they are and forbid 302. I always hate when machine thinks it is more cleaver than I and that it knows what I want better than I know it.

Sometimes I want to open a new web site. I know the name already, I set it up and it is available from my home ip (I work from home). But for the rest of the world it always redirects to my home page. So, if I register typo3-super-hero.com, I could develop it as long as I wish. I will see it and the rest of the world will get 302 to typo3bloke.net. And typo3bloke.net will be perfectly healthy. I could even go to Google's web master console and set up typo3-super-hero.com as an alias for typo3bloke.net to prevent it from appearing in the index separately. All is in my hands! I control it all. And I still can use 302.

-- 
Dmitry Dulepov
TYPO3 Core team
My TYPO3 book: http://www.packtpub.com/typo3-extension-development/book
In the blog: http://typo3bloke.net/post-details/tag_your_typo3_extension_releases_in_svn/