[Typo3] Weirdness with google indexing
Boris Senker
typo3 at dvotocka.hr
Fri Jun 17 02:21:49 CEST 2005
"Bernhard Kraft" <kraftb at kraftb.at> wrote in message
news:mailman.1.1118951571.8810.typo3-english at lists.netfielders.de...
> Boris Senker wrote:
>
> Where have you got that deep insight into google indexing mechanisms from
> ?
> Is that somewhere online to read ?
>
Hi Bernhard,
um... more or less. I have read this sometime ago (I was exploring and doing
pretty much reading on the way SEs work for a couple of months) and later
witnessed this in practice while submitting sites and observing the logs.
The main difference in two Googlebot appearances is the first, shallow quick
crawl where site is queued up for a deep crawl, and later the deep big crawl
where full pages are actually fetched and the data is transmitted to the big
indexer.
But there is a mention of the way it works here also:
http://www.googleguide.com/google_works.html
This page describes it pretty nicely, I'll quote shortened:
------------------------------------------------------------------------------------------------------------------------------
Google consists of three distinct parts, each of which is run on a
distributed network of thousands of low-cost computers and can therefore
carry out fast parallel processing. Parallel processing is a method of
computation in which many calculations can be performed simultaneously,
significantly speeding up data processing.
a.. Googlebot, a web crawler that finds and fetches web pages.
b.. The indexer that sorts every word on every page and stores the
resulting index of words in a huge database.
c.. The query processor, which compares your search query to the index and
recommends the documents that it considers most relevant.
.....
Googlebot, Google's web Crawler
Googlebot is Google's web crawling robot, which finds and retrieves pages on
the web and hands them off to the Google indexer. It's easy to imagine
Googlebot as a little spider scurrying across the strands of cyberspace, but
in reality Googlebot doesn't traverse the web at all. It functions much like
your web browser, by sending a request to a web server for a web page,
downloading the entire page, then handing it off to Google's indexer.
.....
When Googlebot fetches a page, it culls all the links appearing on the page
and adds them to a queue for subsequent crawling...... etc etc. more in link
I have pasted.
------------------------------------------------------------------------------------------------------------------------------
Sometimes less, sometimes (lately rarely) more, but regularly - first
normal, shallow crawl, then after some time big, deep crawl. And those big
crawls are submitted to indexer in runs when the whole Google index is
periodically updated in cycles, called 'Google Dance'.
http://dance.efactory.de/
http://www.google-dance-tool.com/
Quote one forum posts I have saved: 'I submitted a page 3 weeks ago, and
Google just finally crawled it yesterday.'. He meant the big crawl here.
There is also a nice post on WebmasterWorld forum concerning Google updates:
http://www.webmasterworld.com/forum3/3726.htm
Cycles used to be mostly like this:
Q: [Google.com] Is there anyway to speed up Googles turn around time?
A: Not really. Google works at its own (often lethargic) pace. Two
weeks to 90 days is the norm for new sites, and two weeks to 60 days for
updated pages.
BTW looks like those cycles have been shortened - lately, looks like mostly
within two-three weeks.
Usually we submit a site, see Googlebot coming within the same day or so
taking a quick 'look' and queuing for a deep crawl (and on this first visit
it stays very short - obviously it really just indexes links and prepares
for a big crawl). After some days a big crawl will happen, and in the
meantime Googlebot will make a few short visits on your site (same bot?
another bot? I don't know, IPs change often and Google has many bots on many
servers worldwide). After the big deep crawl happens, the data is sent to
'Big Pappa Indexer'. Then within approx. two-three weeks (sometimes more),
the Big Indexer does his Google Dance where it refreshes it's indexes and
rankings, and sites appear on Google fully indexed.
Some sources claim that the main Google index isn't refreshed fully until
all queued domains have sent the information back. I don't know that.
Highly ranking, often updated pages are updated on Google index within a
day - but that goes for sites already indexed and ranked on Google.
There is one more tip for Yahoo and people that use tt_news - tt_news also
provides a RSS feed. That is a very nice backdoor to get into Yahoo quickly.
Enable and adjust tt_news' RSS feed, and login to your free My Yahoo
account. There you have a possibility to add content. And there is a very
fine print at the bottom of the Change Layout page with a link - Publish RSS
on My Yahoo! . Within the FAQ that opens there is a link - add by RSS URL .
Add your tt_news RSS URL there, and your own site's RSS feeds will appear on
your free Yahoo page when you login. And a desired sideeffect of it is -
this will drag Yahoo indexer very quickly to your site and add your site to
Yahoo without waiting for free submit for months.
More information about the TYPO3-english
mailing list