[Typo3] Weirdness with google indexing

Fri Jun 17 02:21:49 CEST 2005

"Bernhard Kraft" <kraftb at kraftb.at> wrote in message 
news:mailman.1.1118951571.8810.typo3-english at lists.netfielders.de...
> Boris Senker wrote:
>
> Where have you got that deep insight into google indexing mechanisms from 
> ?
> Is that somewhere online to read ?
>

Hi Bernhard,

um... more or less. I have read this sometime ago (I was exploring and doing 
pretty much reading on the way SEs work for a couple of months) and later 
witnessed this in practice while submitting sites and observing the logs.

The main difference in two Googlebot appearances is the first, shallow quick 
crawl where site is queued up for a deep crawl, and later the deep big crawl 
where full pages are actually fetched and the data is transmitted to the big 
indexer.

But there is a mention of the way it works here also:

http://www.googleguide.com/google_works.html

This page describes it pretty nicely, I'll quote shortened:

------------------------------------------------------------------------------------------------------------------------------
Google consists of three distinct parts, each of which is run on a 
distributed network of thousands of low-cost computers and can therefore 
carry out fast parallel processing. Parallel processing is a method of 
computation in which many calculations can be performed simultaneously, 
significantly speeding up data processing.

  a.. Googlebot, a web crawler that finds and fetches web pages.
  b.. The indexer that sorts every word on every page and stores the 
resulting index of words in a huge database.
  c.. The query processor, which compares your search query to the index and 
recommends the documents that it considers most relevant.
.....

Googlebot, Google's web Crawler

Googlebot is Google's web crawling robot, which finds and retrieves pages on 
the web and hands them off to the Google indexer. It's easy to imagine 
Googlebot as a little spider scurrying across the strands of cyberspace, but 
in reality Googlebot doesn't traverse the web at all. It functions much like 
your web browser, by sending a request to a web server for a web page, 
downloading the entire page, then handing it off to Google's indexer.

.....

When Googlebot fetches a page, it culls all the links appearing on the page 
and adds them to a queue for subsequent crawling...... etc etc. more in link 
I have pasted.

------------------------------------------------------------------------------------------------------------------------------

Sometimes less, sometimes (lately rarely) more, but regularly - first 
normal, shallow crawl, then after some time big, deep crawl. And those big 
crawls are submitted to indexer in runs when the whole Google index is 
periodically updated in cycles, called 'Google Dance'.

http://dance.efactory.de/
http://www.google-dance-tool.com/

Quote one forum posts I have saved: 'I submitted a page 3 weeks ago, and 
Google just finally crawled it yesterday.'. He meant the big crawl here.

There is also a nice post on WebmasterWorld forum concerning Google updates:

http://www.webmasterworld.com/forum3/3726.htm

Cycles used to be mostly like this:

      Q: [Google.com] Is there anyway to speed up Googles turn around time?
      A: Not really. Google works at its own (often lethargic) pace. Two 
weeks to 90 days is the norm for new sites, and two weeks to 60 days for 
updated pages.

BTW looks like those cycles have been shortened - lately, looks like mostly 
within two-three weeks.

Usually we submit a site, see Googlebot coming within the same day or so 
taking a quick 'look' and queuing for a deep crawl (and on this first visit 
it stays very short - obviously it really just indexes links and prepares 
for a big crawl). After some days a big crawl will happen, and in the 
meantime Googlebot will make a few short visits on your site (same bot? 
another bot? I don't know, IPs change often and Google has many bots on many 
servers worldwide). After the big deep crawl happens, the data is sent to 
'Big Pappa Indexer'. Then within approx. two-three weeks (sometimes more), 
the Big Indexer does his Google Dance where it refreshes it's indexes and 
rankings, and sites appear on Google fully indexed.

Some sources claim that the main Google index isn't refreshed fully until 
all queued domains have sent the information back. I don't know that.

Highly ranking, often updated pages are updated on Google index within a 
day - but that goes for sites already indexed and ranked on Google.

There is one more tip for Yahoo and people that use tt_news - tt_news also 
provides a RSS feed. That is a very nice backdoor to get into Yahoo quickly. 
Enable and adjust tt_news' RSS feed, and login to your free My Yahoo 
account. There you have a possibility to add content. And there is a very 
fine print at the bottom of the Change Layout page with a link - Publish RSS 
on My Yahoo! . Within the FAQ that opens there is a link - add by RSS URL . 
Add your tt_news RSS URL there, and your own site's RSS feeds will appear on 
your free Yahoo page when you login. And a desired sideeffect of it is - 
this will drag Yahoo indexer very quickly to your site and add your site to 
Yahoo without waiting for free submit for months.