Godaddy Corner Ad

Is Google Broken?

By: Submissions / Press Releases
Thursday September 02 2004, 00:58:46
http://www.w3reports.com
Category: Search Engines




Between Aug 04, 2003 and Aug 25, 2003 (just 21 days), Google added a little over 1.2 billion Web pages to their index. But since Aug 25, 2003 and today, Google hasn't added one single Web page to their index (at least according to Google they haven't).

Today, Google's home page states:

2004 Google - Searching 4,285,199,774 web pages

Now let's look at the history of Google's home page using "web.archieve.org"
(they archive Web pages which you can review on-line as they were back
when):

Aug 25, 2003

2003 Google - Searching 4,285,199,774 web pages

Now isn't this strange? The exact same number of Web page one year ago as it
is today and next week will be the same. Let's go back and see when this
number was different:

Aug 04, 2003

2003 Google - Searching 3,083,324,652 web pages

So what does this mean? It means either Google is lying to us all or they
have been dropping as many pages as they have been adding them.

My guess is that in Aug 25, 2003 Google's index was full. Why do I say this?
Because Google's white papers were freely available to anyone. This meant
that you could access the actual documents publish by Google founders before
Google became public and get a glimpse of how Google was created. According
to these documents, Google was written in C and C++ using ANSI C and Linux.
The database was constructed using a Document_ID that is associated with
each Web page. This document_ID was published as being a 4-byte unsigned
long integer. This means that for every single Web page Google has in their
index, an ID was created to identify this Web page. But like everything,
there is a limit and a 4-byte unsigned long integer has a maximum value of
4,294,967,296. So if no changes are made to their database structure, it
would mean Google has probably reached this threshold. And as new pages are
added, old pages are removed (disappear). Quite alarming isn't it?

So Google may have a serious flaw in their database structure and design.
Google has used an 4-byte unsigned long integer to store the document ID
(every page in Google's index). In Linux (which is what Google uses), this
variable is 4-bytes long, and has a maximum of 4.2 billion (4,294,967,296)
before it rolls over to zero. This may also be one of the reasons pages
appear to be dropping from Google's index at an alarming rate (tens of
thousands of search results where I can prove this is happening). They may
have already run out of space and the document_ID is no longer associated
with the content stored in the database which in turn will return empty
results for a particular URL.

Can this problem be corrected? Sure it can, but Google has 15,000+ Linux
servers and 4.2 billion document_IDs to convert. This is not going to be an
easy task at this point. Also every single word in their inverted index is
associated with a document ID so the conversion will probably take months if
not a great deal longer.

In addition to this major problem, there are other major flaws with Google.
One of these is with their PageRank algorithm.

According to a recent study, 75% of keyword searches on the Web are handled
by Google. First off let me say that while Google may indeed handle 75% of
keyword searches, you also have to consider how many of these people looked
elsewhere as well. Yahoo! claims an Internet reach of over 80% so they too
are handling these same requests, but probably delivering better less biased
results.

Given that Google returns currently "popular" pages at the top of search
results, only proves Google is unfairly penalizing newly created pages that
are not yet "popular". While this statement may be an exaggeration, it does
contain an alarming bit of truth. To find a web page, many users go to
Google (or another search engine using Google's index like AOL) and issues a
keyword query.


If the users cannot find relevant pages after several different keyword
queries, they are likely to give up and stop looking further. So a Web page
not indexed by Google or ranked poorly by Google (low PageRank - poor Google
popularity) will not likely be viewed by many users. And because of this
will never become popular according to Google's own admissions.

While Google takes more than 100 different factors into account in
determining the final ranking of a Web page, the core of their ranking
algorithm is based on a metric called PageRank. PageRank is nothing more
than a "link popularity" metric, where a page is considered more important
if the page is linked by many other pages on the web that Google also
considers important (popular and already in Google's index). Google puts a
page at the top of search results that contain the keywords the searcher is
looking for or by keywords found in the anchor text of those pages linking
to it. The more popular the links pointing to this Web page, the more
popular this Web page will be. So the popular continue to get "more" popular
and the less fortunate ones that are new to the Web continue to be held back
from this popularity game Google is playing.

It is important to understand the distinction between the "importance or
quality" of a Web page and the relevance of "popularity". What do you want I
ask? Do you want to see the same old popular sites day in and day out or
would you like to see relevant content rich newly discovered Web pages? As
long as you continue to use Google you will be promoting this popularity
game and your competition will continue to rise above you not to mention you
will be missing out on the new stuff.

Since popular pages are repeatedly returned by Google as top results, they
are also the easiest for users to discover, which increases their popularity
even further. In contrast, a currently "unpopular" page is often not
returned by Google, so few new links will be created to the page, keeping
the page's ranking down. This "rich-get-richer" scheme can and does destroy
the quality of search results.

PageRank is an unfortunate algorithm for both users and Web page authors and
useful information is being ignored by Google simply because a new page or
site has not had a chance to get noticed and under the PageRank algorithm,
will never get noticed.

A recent article at motleyfool.com stated that 98% of Google's revenues come
from their advertisers. This would mostly consist of Adwords and Adsense.
But all it would take is a firewall company, Virus protection company, AOL
or Microsoft to simply create a Google ad blocker and it will be the end of
Google over night. These companies as well as Google already provide pop up
and pop under blockers and writing a Google ad blocker would be even more
simple to do.

I have months of research to prove my statements.

Just my two cents for today!
Anthony Federico


Viewed 95601 times.

Copyright © 2004-2007 Submissions / Press Releases. All Rights Reserved.
Print

Add A Comment

Comments

dzd wrote:
Do kindly post some of your 'months of research.'
09/02/04 16:38:47
Matt wrote:
You guys should look at this page/article in Safari. Amazing that a site for webmasters wouldn't be cross-platform/browser compatible.
09/03/04 16:10:46
James Velaquez wrote:
Reply to:Matt

Must be your computer, I am using Safari 1.2 and it works fine
09/03/04 16:54:46
Anthony Federico wrote:
What additional "research" do you need? Google clearly shows:

2004 Google - Searching 4,285,199,774 web pages

On their home page and has now for months. Google has been boasting for the last couple of years how "big" they are so this statement on their home page is clearly important to them. What are they going to say "I forgot to change it?"? And if and when they do change this number, will that number also be a lie? What are we to believe now?

How does/did Google store DocID's?

http://dbpubs.stanford.edu:...

OR:

http://dbpubs.stanford.edu:...

In the link above you'll find the original publication for "The PageRank Citation Ranking: Bringing Order to the Web" by Page, Lawrence; Brin, Sergey; Motwani, Rajeev; Winograd, Terry. As stated in this publication:

We convert each URL into a unique integer, and store each hyperlink in a database using the integer IDs to identify pages.

So each document is an integer and....

Google was built and still uses cheap Linux desktop machines (about 15,000 of them) and open source C and C++ as well as Python. These were and most likely still are, 32 bit CPU machines. In effect you have 32 bits of data to play around with and every document has a unique representation "DocID". Unfortunately you cannot represent fractions, or numbers greater than 4294967295 (2^32 - 1).

This of course doesn't mean Google has not done something different since this publication and as stated above "So Google 'may' have a serious flaw in their database structure and design", but the history of total documents Google states as being available for search, makes one think that Google has indeed hit this limit.

Just do a search at Google on most anything and I'm sure you'll find empty pages in the results. So you want some examples of pages that use to be in Google and now still are, but Google returns an empty page as the result for the page? (I am not affiliated with any of the sites I present below). Try these Google searches:

site:Cre8pc.com

http://www.cre8pc.com/blog/...
Similar pages

http://www.cre8pc.com/blog/...
Similar pages

Do these pages not exist? Are they not indeed in Google's index? Do they not have a TITLE and content? Of course they do. Another search example:

site:www.liberty72.com

http://www.liberty72.com/L7...
Similar pages

http://www.liberty72.com/L7...
Similar pages

This of course only shows just a few of the empty pages Google continues to display and your searches will uncover many more. It does not prove that these appear empty in Google's SERP's because of an Integer limit. But it does show Google is broken and the majority of these broken links came about the time Google hit the 4 billion mark.

I don't think you need me to explain the faults of PageRank or how to stop Google Adsense and Adwords from showing up in your browser as you surf the Web.
09/03/04 22:24:30
arius wrote:
That is one scary article.
It's like Google could be going through its own private y2k crisis.

I think PR and link popularity is working against Google sometimes.

They should never have published how they define relevancy.

Now the algorithm is subject to abuse. I guess its the fault of the Patent Office. Who told them to publish their registry on the internet?

All this link exchanging is just flooding the internet with mediocre content that is well promoted.

One solution would be just to add a second integer column to their database that will index all the pages on a site and associate it with a site_id.

There can't be 4 billion domains on the internet yet. And I bet no site has 4 billion pages.

The orginal document_id column should be the site_id.

What do you think?
Reid Technologies Inc.
09/04/04 07:28:25
Anthony Federico wrote:
Google is severly broken indeed!

One HUGE problem is that Google's DocID reference to the page content is often lost in the cluster. You can find these broken links on almost any query in Google. I'm going to assume your site is reidtechnologies.ca. So let's check Google.com and see what they have lost of yours:

site:reidtechnologies.ca

reidtechnologies.ca/html/internet_info.php
Similar pages

Not too bad yet... But at the end of these results please click on Google's option for "repeat the search with the omitted results included." and see what we find:

reidtechnologies.ca/html/contact_us.html
Similar pages

reidtechnologies.ca/index.php
Similar pages

No TITLEs or excerpts, but yet the pages are there and they contain these. Google's links are broken... Although your site is not hurt too badly at least at this point in the Google corruption.

I'm sure if you review your Google bot visits you'll find Google continues to visit these empty pages, but never seems to include their content in search results. The content is in their barrels, but the association between the DocIDs and the point of where the content can be found in the barrel is broken.

In addition your Google PageRank also crumbles because of the lost DocID association. Since the reference to the content is gone, the PR is also gone. People linking to any of these pages will not help the PR or even recover the page within Google's index. All you can do is wait and one day you may see the pages back in Google's index. But.... just wait a little longer and you'll find the pages missing again and I can almost guarantee that.

If you have an old copy of Google search results laying around on your desktop where your empty pages above are shown in these results, you'll be able to click on the "Cache" link and you'll find that Google still has these pages in their cache index. They will gladly retrieve the content and present them in full glory. So this proves Google has the content, but can no longer extablish the DocID link to the content. Why does it work with the old search result? Because the MD5 Checksum of the DocID is still valid and the content still exists.
09/04/04 11:04:24
Daniel Brandt wrote:
I have written two articles on this very topic.

June 9, 2003: "Is Google broken?" at http://www.google-watch.org...

August 29, 2004: "Google is dying" at http://www.google-watch.org...

I'm happy to finally have some company on this topic. For 15 months now I have seen nothing but insults and totally mistaken technical arguments from Google cultists and wannabe geeks telling me that I have no idea what I'm writing about.

Anyone with more evidence on this topic can get in touch we me at Google Watch.
09/04/04 13:50:18
Panivino wrote:
Interesting!

Although the 32 bit limit has been mentioned in the past, you introduce another dimension which is the possibility that Google, in spite of their heavily hammered philosophy (don't be evil), could lie.

As a non English searcher I see some "surprising" statements on Google's localized sites.
Just an example: stemming has been noisily introduced as a great improvement which allows selecting some additional highly relevant pages otherwise ignored.

Curiously enough in all non English local versions I use Google systematically explains that it does not stem in order to produce better results.

E.G.: http://www.google.com.br/in...
"Para conseguir resultados mais precisos, o Google não utiliza radicais de palavras nem caracteres curingas..."
Basically: In order to get sharper results, Google does not use stemming.

- or stemming is not good in itself (poor American users),
- or Google is unable to stem in other languages than English but prefers letting non English speaking users think that stemming is bad (we can't help feeling they consider us as idiots).

"Stemming in your language is not (yet) available" would be less confusing and more sincere.

Many examples questioning sincerity can be found, but this is not your point.

The 4,285,199,774 pages figure has been displayed for many months on any local version I use, just as if it was part of the logo.
Whether .com, .com.br, .es, .de, .fr, .it, .se, .co.uk, etc. it's always the same figure.

When searching local versions with the default option (the entire web) the number of hits returned may slightly vary (data bases cannot be exactly synchronized) but is updated each time, unlike the figure Google produces to promote its might.

So let's take it as a safe limit Google would have set for their index.

As Google discovers new pages every while, your statement makes sense: to display new pages they would have to push out older pages.

If this was true I suppose Google would have exclusion rules, for instance a dramatic criterion such as PR, blindly unrelated to the topic popularity itself, which seems hardly possible as it would end with a 'what's hot today?' search engine.

Google keeps incredible amounts of pointless pages just created for the sake of spamming it and probably making some click through business (including Adsense), while content rich and very focussed pages sometimes disappear.

If Google's capacity to identify links had a top, why would they keep so many duplicates in their index at different domains?

Beyond content duplication, Google is the only engine which can afford displaying aliases (http://domain and http://www.domain) for those sites which deliver on both paths.

Furthermore, I could (from time to time) observe a phenomenon I can't explain: Google is able to display two versions of the same url, one can be found on the corresponding web site but the other one is really looking like an older version (different copyright year for instance).

Would a search engine near the limit of its index capacity accumulate pages that don't exist anymore, broken links, different versions of the same url and the like?
Would it eradicate pages with hundreds and even thousands of inbound links and keep tons of pages from totally unpopular sites?

Do you have a technically reasonable explanation that would not ruin your 32 bit theory?

You point a phenomenon you call 'empty pages'.
In last February I saw so many that I played a little with single word queries. I was really impressed! I could find words with up to 23 hits in the top 100 only displaying the url. 7 to 12% was absolutely common.

I thought these were urls pending of crawling that Google displayed to show off before the IPO.
In my daily searches I also noticed that long standing pages had disappeared totally, i.e. not found with their url as query or the 'site:' search. I tracked some urls and could see some pages back after some weeks or months, then some disappeared again, others became what you call 'empty pages' and others replaced with an older version.

As a matter of fact I noticed this phenomenon by the end of last year, and it spread very quickly, to sites of any size.

Your input concerning saved search results is just great, I made a few tests.

So your suggestion is that certain urls lose their "DocID" and from there whichever thing happens, creating a chaotic situation.

Could this explain the different versions of a same url I could observe in a same result page?

I agree with you upon the bias introduced by a self fed popularity factor and link manipulation, and upon the vulnerability of a revenue based on a script which other parties could block like a vulgar pop-up.

Sorry for this long comment, but I don't come across an interesting article everyday. And if you could provide some explanations to the strange behaviors I mentioned, I would be a more than happy reader.
09/04/04 15:47:13
Anthony Federico wrote:
Reply to Panivino part 1:

Panivino, you bring up some additional Google problems that I'm afraid I simply can not answer in the time I presently have. I will however give you my thoughts on some of these below.

Panivino wrote:
it would end with a 'what's hot today?' search engine.

Well that's what Google is and they freely make this clear:

http://www.google.com/webma...

"if no other site links to yours, it may be difficult for our crawler to find you. Conversely, if many sites link to your page, there is a good chance we will find you"

and:

"If we have not picked up your site and it has been several months, then it is likely that our spiders are not able to find your site. If you increase the links pointing to the page, Google will likely find your site in the future."

So even though Google says to NOT request links in order to increase your PageRank, they also state that increasing your Links may be the only way to be included in Google.

So if you are a new site with tons of useful information, chances are you won't be found in Google until someone else links to you who are already in Google's index. Now that's what makes a great search engine right?

Panivino wrote:
"If Google's capacity to identify links had a top, why would they keep so many duplicates in their index at different domains?"

This was not on purpose, but instead is part of their design and implimetation. To figure out how and why duplications appear, you have to go back and see how Google was designed. One thing we have to consider is the publication "Dynamic Data Mining: Exploring Large Rule Spaces by Sampling." (by Founders Sergey Brin and Lawrence Page):

http://dbpubs.stanford.edu:...

and the actual crawling process "Efficient Crawling Through URL Ordering" (by Junghoo Cho, Hector Garcia-Molina and Co-Founder Lawrence Page):

http://dbpubs.stanford.edu:...

As stated in the publication:

"it is important for the crawler to visit "important" pages first, so that the fraction of the Web that is visited (and kept up to date) is more meaningful."

So basically only a "fraction" of the index is kept up to date and only the "popular" get refreshed often. In addition new pages are discovered during these crawls and also placed in queue. This method of crawling can cause chaos simply because a single Web page can often be found in many different ways. Example:

http://123.456.789.012/
http://www.somedomain.com/
http://somedomain.com/

Let's take this example and see how Google may discover and index these.

Google starts out possibly finding "http://123.456.789.012/" first. It does not matter here how Google discovered this page, all that matters is that Google has and is about to visit the page. Google now visits this page and indexes it. Days, weeks or even months may go by and Google now discovers the "http://www.somedomain.com/". When Google visits this page the author made some text changes. Maybe something as small as a copyright year in the footer of the page. An MD5 checksum of this page does NOT find that this is a clone or duplication of the "http://123.456.789.012/" simply because the content is now different. And because only "popular" pages are re-visted often, the "http://123.456.789.012/" may not be re-indexed or crawled for months or even years later.

Next Google discovers the "http://somedomain.com/", but as in our first example, the author made some text changes. Because of this Google again does not find that this is a duplication of either of the other two it has already indexed. This now causes their index to store three different versions of the same page. And if you continue to make changes, you may never find their index cleaning up and removing the duplications.

This problem can still exist even if you never make changes to the Web page. Why? Google could easily consider two of the pages above as clones. It will then decide based on PageRank and content computations, which is the original page and instead deliver that particular page in the results. And because Google does NOT actually delete duplicate content, all three URLs, while really the same, are still in Google's index and only the one with the highest PageRank ever gets re-visited.
09/05/04 14:58:34
Anthony Federico wrote:
Reply to Panivino part 2:

There is also the issue of an IP change which I could go on forever discussing, but instead I will only briefly discuss this here.

Often people will register their domain and first be parked at the registars site. Often as a free service, these registars will submit your home page to various search engines. Then the author moves the domain to a hosting company who of course has a different IP than that of the registar. This opens a whole new can of worms that I call "the IP/domain bug" which appears to cause the dropping of pages from the domain because the association of the IP to domain is broken. The problem is that the OLD IP still exists and therefore is not considered dead, but the host containing that IP can no longer fetch the domain because it has moved. This in turn returns an error to the Google crawler. But this error is not your standard "404 Page Not Found" error (which Google will then remove the page from their index). Instead many servers will simply return the "302 Moved Temporarily" header which means:

"Moved Temporarily - The server did not fulfill the request because the URI has temporarily changed. Along with this code, the server sends a Location header to indicate the temporary URI of the requested document. The client continues to use the old URI in future requests."

So Google continues to use the "old URI" for future requests and may never actually remove the URI from their index. Now if instead these servers returned a "301 Moved Permanently" header which means:

"Moved Permanently - The server did not fulfill the request because the URI no longer exists. Along with this code, the server sends a Location header that indicates the requested document's new URI. The client directs all future requests to the new URI."

Google will then replace the URI they have to the new location which will refresh their index (at least that's what we hope will happen).

Sometimes however the old server will still have the domain (named based Web server) configured for the old domain and/or the old Web pages still available. In this case the URI and content will never be updated.

Sometimes people move their site to a different hosting provider. What often happens is that the old hosting provider never removed the domain name from their DNS servers. This again can cause serious problems with Google. Search engines like Google will often store the IP address of the domain name (possibly a SiteID) when it discovers it. So taking an example from above, Google may first discover this URL:

http://www.somedomain.com/

And Google stores and associates the IP "123.456.789.012" to this domain. Why might they do this? To save on DNS queries which is the bottle neck of running a Web crawler of this size. What Google may do for future requests is first make a request for the IP (which is much more efficient than doing a domain name lookup) and once connected, make a request for the host name (domain name) and path. Let's see how this type of request might be made using telnet:

telnet 123.456.789.012 80
GET / HTTP/1.0
HOST: somedomain.com

Now if this server still contains the old domain in its named based system (virtual host table), the server will try to resolve this request. But because the domain and path no longer reside on this server, the server or Google simply times out and Google moves on. This does NOT cause Google to remove this URI from their index. Instead it figures the server was down, overloaded or some other problem occurred and the old content continues to be in Google's index.

There is also the possibility of another reason for these dropped pages. This reason could be driven by the need to show profit as they moved to go public. As you may or may not know, a search engine without additional products and services can not survive and certainly won't show a huge profitability. This has been proven over the years.

Now if the founders of Google are to "get rich", they need to diversify. This became more important to Google when Yahoo! announced they were no longer going to retrieve search results from Google and MSN and others were going to go head-to-head with them. For Google to go public, they needed to show profit to their future investors. And one way of doing this is to develop a revenue based Adwords and Adsense system and to help make this profitable, return poor filtered results. So you don't believe that this could be a motive? Well let's see what the founders of Google has to say about this very subject (continued in part 3):
09/05/04 15:00:51
Anthony Federico wrote:
Reply to Panivino part 3 final:

In a paper published by Google Co-Founders Sergey Brin and Lawrence Page called "The Anatomy of a Large-Scale Hypertextual Web Search Engine":

http://dbpubs.stanford.edu:...

You'll find some interesting reading:

8 Appendix A: Advertising and Mixed Motives

"Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline's homepage when the airline's name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines. However, there will always be money from advertisers who want a customer to switch products, or have something that is genuinely new. But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm."

What I found most interesting about this comment is the statement "But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.". And the comment "from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want.".

So then how could it be that their advertising revenue has risen so dramatically if Google always returns top relevant results? It's very possible that the dropping of URLs could easily have something to do with this dramatic increase in revenues. You be the judge... Either way you look at it, I feel Google should be required to publically address these problems and tell us the real reason behind these issues.
09/05/04 15:01:48
Panivino wrote:
Reply to:Anthony Federico


Part 1 of your answer

First, thank you very much for having taken time to answer side questions, in a such a detailed manner.

From a user point of view I also agree that there is something ambiguous in Google's position concerning links which on one side are a cruel necessity for a site to be seen and on the other side should only result from a spontaneous action (I like this page, have a look at it).

On most broad commercial topics I search, link manipulation often stinks.
On the other hand very narrow topics, with rather uncommon words, often show pages that use 'natural' text for links.

When being on the first result page is so critical to many, it seems hardly possible site owners gently wait for others to link to them just to please Google, especially if they fear the kind of text generally used in spontaneous linking (XYZ Company, click here, more info etc.) won't be of much help to reinforce their topic relevancy.

Then you provide a wonderful explanation to a phenomenon I saw well before an index size limit could exist (spring 2001 - spring 2003):
a same web site appearing in the top ten under its IP address and as a totally different site under its domain name.

From time to time I had seen sites listed under their IP address, and I had assumed that there was no domain name at this address, as in the old times. But when I saw the same site listed under its IP, with cluster (2 results), then under its domain name, with cluster too, I had no explanation for that.

I could even see in early 2003, in a single occasion and for a couple of months only, a site hitting six times in Google top ten, twice with its IP, twice as http://domain and twice as http://www.domain.
I first thought it was a clever trick, but when visiting the pages I could find absolutely nothing strange, rather long pages and no effort to repeat keywords. The topic was very little competitive and this additional visibility absolutely not necessary.

Now you provided this bright explanation I am convinced it is a flaw and I am just afraid we could get even more spam if indelicate site owners became aware.

Anybody could open a server, promote content by linking to the IP address, then after a certain while associate a domain name to this IP address and link to it as http://domain and finally when both appear in results, promote http://www.domain in order to get up to 6 hits before Google realizes anything.
09/06/04 17:59:52
Panivino wrote:
Reply to:Anthony Federico


Part 2 of your answer

This one is just frightening!

And unfortunately it perfectly explains the 'old version / current version' dance I could observe many times (on Google only) in my repeated searches for given products.

So it would be enough a site moves from one hosting service to another, or even from a server to a new one for it to have visibility problems with Google.

Even if you restrict the causality to DNS servers or virtual host tables not updated or not updated in a manner Google can understand (a clear 'nothing here' if I understood your point) this leaves a lot of room for terrible wanderings due to Google foundation: linking.

Google indexed the inbound links to a page before an IP change occurs. After the IP change it cannot find the page at the old IP address, but still considers there is something there, later it discovers the same page or a newer version at the new address and does not know that it is linked to. This page is displayed for some time and then replaced by the "old" one which would have a better PR.

The same kind of confusion could occur for outbound links.

So each new PR computation could become a bad joke.
Pages indexed at the old IP would end losing their PR and pages at the new IP could perfectly never acquire PR because Google remembers inbound links as links to the old IP.

If a significant proportion of web sites were moving from an IP to another, let's say 15% each year, there would be a permanent loss in the linking map. Relevancy as per the algorithm they use would be at permanent risk. Am I correct?

May be then pages could be dropped because according to Google understanding they would have lost any interest?

Finally the only bad neighborhood (a term I saw in some articles used to explain why pages were castigated) would be linking to and/or to be linked from web sites changing their IP address (under the restrictions you explain).

This really sounds scary! (and would explain better than mere crawling frequency why results change more deeply on Google than on any other search engine, especially if in addition DocIDs are lost due to limit).

And I didn't mention the case Google would really have a filter to punish exact duplicate content. The same page moved to a new IP (no duplication in fact) would "compete" with the non deleted record for the old IP and one of them could be castigated.

In the past two years when searching in non English western languages, I saw in Google results a multiplication of pseudo directories only going out to generate at any cost click through traffic. This is extremely annoying because those people trigger almost any concept, not only very competitive ones such as travel for instance.

I saw that automated pseudo directories by reproducing Google's results most often kill the pages they target.
I saw that some well established pages I used to find for months or years disappear while a brutal copy by thieves appears and sits where the original was.

As you so kindly answered all comments posted here, I can't resist asking a new question:
since it is possible to give a page a topic from a distance (the funny or less funny Google bombs), would it be possible to steal relevancy by just linking to relevant sites, or widely including content of other sites to get them out and would it be possible to "acquire" PR by just acting as a thief in some manner?
09/06/04 18:08:21
Panivino wrote:
Reply to:Anthony Federico


Part 3 of your answer

I easily admit that a search engine must be supported by a profitable activity. For me a fee based search service would make sense, but I understand it would probably not cover expenses nor allow huge profits.

Once again I must acknowledge that GoogleAds look more and more relevant and finally, the worse the results the best the ads look, which is also true on any other search engine because advertisers cannot afford being irrelevant.

As a non English speaking user I can tell you Google made much more efforts in language management for their ads than for their search engine. They track your IP (country), your interface language (a hint upon your preferred language) and your query language and they push you rather focussed stuff in one or more languages at a time, not defaulting systematically to English as it was initially.
The least one can say is that they didn't spare efforts to make ads irresistible.

Your point is humorous. Empty hits, as you call them, do not look good at all and can favor ads reading, just like heavy spam and unrelated hits pushed at top by popularity in spite of absurd distance between keywords.
All this could perfectly explain the dramatic increase in revenue, unless people prefer classified ads rather than the result of powerful theories implemented by angelic big brains.

I read Google's recommendations to webmasters that you pointed and found that rather pathetic.

"Don't be evil, don't turn style sheets off, don't disable frames, don't turn javascript off!" should be added to their recommendations to users.
This would avoid us to see invisible huge text, to come across extreme keyword stuffing, to be smartly redirected before we could read convincing stuff, and to stupidly be unable to access GoogleAds.

Well, I would like to thank you again for this in depth article and for your further comments that a simple search engine user finds extremely consistent, totally compatible with your 'reached limit' theory, and absolutely terrifying because beyond a simply defective implementation remains the possibility of a collective illusion.

Now I will pay more attention to details when I feel uncomfortable with results and of course I wish your article could contribute to make users more demanding, search engines more reliable and Google less ambiguous.
09/06/04 18:13:23
Anthony Federico wrote:
Before I comment further I would like to mention that it appears Google maybe watching. A couple of pages that I have quoted above as being missing in Google's index (for several weeks at least), are now strangely reappearing. But that's OK because you all know how to uncover these empty pages on your own and there are tens of thousands of them I'm sure.

Panivino wrote:
"As you so kindly answered all comments posted here, I can't resist asking a new question:
since it is possible to give a page a topic from a distance (the funny or less funny Google bombs), would it be possible to steal relevancy by just linking to relevant sites, or widely including content of other sites to get them out and would it be possible to "acquire" PR by just acting as a thief in some manner?"

I'm glad you asked that question because it is easy to steal PR right from under your competition. To see how this can be done, let's go to Google.com and enter the following search term:

http://www.yav.com/MTM/MTM0...

(direct link is) http://www.google.com/searc...

The above will return:

Meta Tag Manager - Online Help
Meta Tag Manager® 2.0 Main window - Online Help. ...

But if you look real close, you'll find the actual URL of the returned result is NOT what we asked Google to fetch. The URL for this returned listing is actually from a completely different site:

http://www.xs4all.nl/~yavel...

But Google goes on to say:

Show Google's cache of http://www.yav.com/MTM/MTM0...
Find web pages that are similar to http://www.yav.com/MTM/MTM0...
Find web pages that link to http://www.yav.com/MTM/MTM0...
Find web pages that contain the term "http://www.yav.com/MTM/MTM0..."

So Google associates "http://www.xs4all.nl/~yavel..." as being "http://www.yav.com/MTM/MTM0..." simply because xs4all.nl duplicated their page.

Now to prove that this page was stolen (not that xs4all.nl didn't have every right to copy this page), let's search Google on the site:

site:www.yav.com

(direct link is) http://www.google.com/searc...

And you won't find the "http://www.yav.com/MTM/MTM0..." page anywhere within Google's results. But I can guarantee you that in Feb 2004 (I have a snapshot of this result) this URL was indeed in Google's index and the URL still exists today. In fact this page was created back in 1999 (see copyright footer). Google just simply turned the page over to xs4all.nl because they copied it and Google gave them all the PR that went along with it. This in turn boosted the overall PR of xs4all.nl. The reason? Google simply decided that xs4all.nl was the original publisher of this content and as you can see, Google now associates the "http://www.yav.com/MTM/MTM0..." as being "http://www.xs4all.nl/~yavel... So the xs4all.nl site inherited the PR from this site making xs4all.nl even more popular in Google's results.

This of course is only one example of many that I have.

I could go on forever with examples demonstrating the many flaws of Google, but until Google publicly addresses these problems, you and I will never know the real reasons behind the Google madness.
09/07/04 01:07:27
John wrote:
I sit here reading all this and I have a hard time believing Google would be so stupid and run out of space with this 4-byte interger thing. Although at the same time, how could a site like Google continue to display on their home page 4, for several months now without a change? What I find really compelling about your article is that Google was able to find the time to create a new Google logo every single day during the Olypics? I guess if they can make a stupid mistake like this, they can certainly have a problem with running out of space.

I also want to add that all the information you presented here is excellent. All the Google bugs you show make sense and I have tested my own site and found the same kinds of problems. I have pages that were created back in 1998 that are content rich with NO keyword hammering or any sign of spam that were always in Google's index and about February of this year, Google for no reason at all, now only shows the URLs in the results pages. I have contacted Google several times and they just give the same canned response stating that my pages have not been fully fetched. But this is complete nonsense because of 6 years they were in Google and now they are empty pages.

I also check my server log files daily and Google visits these pages sometimes twice in one day and they have done so for 6 months. It's not a firewall problem or something were we are blocking Google because Google seems to find and index several other pages on the site. These pages have no duplicates either. Google is just really messed up and I thank you for exposing these problems to the masses.
09/07/04 03:05:10
thePhysicist wrote:
another explanation: they forgot to switch the display over to the new index, which uses 64bit docids and the information is still coming from some remnant old index that doesn't get updated anymore...
09/07/04 05:52:55
sz wrote:
"So a Web page not indexed by Google or ranked poorly by Google (low PageRank - poor Google
popularity) will not likely be viewed by many users."

now this is true to a certain extent. hits thru search engines can add up. although that really depends on the content / purpose of a site. now, in most cases it's not that if it's not googleble it doesn't "excist". there are more ways to attract visitors to a site than just search engines.
09/07/04 06:21:56
thePhysicist wrote:
Reply to:Anthony Federico

there is a subtle misconception in your analysis. when you ask google to search for "http://www.yav.com/MTM/MTM0... it actually searches for "www", "yav", "com", "MTM" and "MTM09". and if you scrutinize googles output, it just re-displays your search term and links it to the pages it found, which happen to lie on xs4all.nl

if you search for just "MTM/MTM09.html", you get both sites, xs4all.nl and yav.com.

what happened in this case is that the pages were initially hosted on xs4all.nl/~yavelow and they later purchased a separate domain and moved the pages to a different server, but still within xs4all's ip-range.

so, nothing fishy here...
09/07/04 06:34:39
sjb wrote:
Nuke 'em all.
09/07/04 09:23:52
thePhysicist wrote:
some alternative answers to some of the questions that arose above:

"In last February I saw so many that I played a little with single word queries. I was really impressed! I could find words with up to 23 hits in the top 100 only displaying the url. 7 to 12% was absolutely common."

Could be that the webmaster has set "no-cache", which google adheres to. Fluctuations can be explained by webmasters fiddling with robots.txt or meta-tags; google could have had a sw-bug somewhere in dealing with it. Without a concrete example it's hard to judge.

"So then how could it be that their advertising revenue has risen so dramatically if Google always returns top relevant results? It's very possible that the dropping of URLs could easily have something to do with this dramatic increase in revenues. You be the judge..."

Simple. It's much easier to buy top adword ranks than to optimize your website so it can compete with its peers... :o)

But I concurr: when existing pages drop from the cache, then there is something bad going on there...
09/07/04 10:26:38
Kris Driessen wrote:
I added a page to my website 9/5/04 and it is indexed already.
09/07/04 10:55:47
JM UK wrote:
We have created sites and submitted them last month, two of them were indexed and added to Google listings within a week !!!
09/07/04 12:10:35
Anthony Federico wrote:
thePhysicist wrote:
so, nothing fishy here..

You've got to be kidding right? Google does not represent this as a "keyword" search. They clearly represent this as being a URL. Just like my example shows:

Google can show you the following information for this URL:
Show Google's cache of http://www.yav.com/MTM/MTM0...
Find web pages that are similar to http://www.yav.com/MTM/MTM0...
Find web pages that link to http://www.yav.com/MTM/MTM0...
Find web pages that contain the term "http://www.yav.com/MTM/MTM0..."

Am I missing something here? Does Google not give us a cache to this URL (which is clearly not correct)? Does Google also not let you "Find web pages that link to" this URL? Maybe I'm missing something. I didn't realize Google often gave us a cache result for all the keywords we enter or gives us a search option for pages that link to our keywords?

So if your theory were correct and Google was searching for keywords rather than the URL, then why when doing a search on this term:

http://www.sex.com/

Google doesn't display a billion URLs? You can't tell me the keywords www, sex and com are not found on millions of other pages and URLs.

The point here is that Google associated the yav.com/MTM/MTM09.html page as being that of xs4all.nl/~yavelow/MTM/MTM09.html. The real point here is how this search of the .com page produces the .nl site. You can't say that Google is searching on the keywords entered in the query because if this were so, it should certainly have produced the .com URL instead don't you think? Would't the .com site be more relevant in this regard?

You also have to look at the real results at Google using real keyword searching and see if you can find both of these two pages. For example, let's enter a search term that would clearly represent both pages:

Meta Tag Manager - Online Help
(direct link to google results) http://www.google.com/searc...

Google will allow you to search through 777 of about 159,000 pages. But you won't find the yav.com/MTM/MTM09.html page among these results. The ONLY way you'll find this page is if you go to the last page in Google's results and click on the "repeat the search with the omitted results included.". When you do this, the yav.com/MTM/MTM09.html page is listed in the #2 spot behing the xs4all.nl/~yavelow/MTM/MTM09.html result.

Of course this is nothing new and other search engines that filter clones work similar, but it does mean the yav.com/MTM/MTM09.html page is hidden and would not likely be found through Google simply because the average user is going to click to the last page of results.

The other point here is that Google made the decision that the xs4all.nl/~yavelow/MTM/MTM09.html page would be the one they displayed in search results. Now this may have been the correct choice this time I can't say for sure, but what if it went the other way and the page was actually stolen and the site stealing the content now is listed in the search results instead and your pages were dropped?

thePhysicist wrote:
you search for just "MTM/MTM09.html", you get both sites, xs4all.nl and yav.com.

The only reason this happens is because there are only two results and this search clearly is not for a particular URL like my example. But if you were searching like a real human for content on the page, you'll never see the yav.com/MTM/MTM09.html and instead, Google associate the xs4all.nl/~yavelow/MTM/MTM09.html pages as that of yav.com.

The PR theft example I present may not have been a good choice simply because these two sites could be one and the same. But the method of stealing PR works exactly the same. People do tihs every day in Google simply because it is so easy to do.
09/07/04 12:22:40
Vi wrote:
This is all nonsense. Think of SPAM. Do you know anything better than PR to keep the spam out? Let me know if you do. 2 bilion pages out of 4 bilion Google pages is just trash anyways. To replace the trash sounds like a good policy to me.
09/07/04 12:30:53
Anthony Federico wrote:
Kris Driessen wrote:
I added a page to my website 9/5/04 and it is indexed already.

JM UK wrote:
We have created sites and submitted them last month, two of them were indexed and added to Google listings within a week !!!

But how could this be? According to Google since late last year:

Google - Searching 4,285,199,774 web pages

So if your Web pages were added which I'm sure they were, I can only wonder who was bumped out?

I never said new pages were not added to the Google index. What I said is that for a year now Google claims you are Searching 4,285,199,774 web pages at their site. I also speculate that Google has hit the 4.2 billion record limit and new pages added only removes others to make room.

A search engine that claims to be the biggest would certainly be telling us as they have in the past. Now I could see if they said something similar to their search results like "about 4 billion", but that's not what they are saying. They are very clear on this figure. It is exactly 4,285,199,774 web pages and it has been this figure for about a year. It is not 4,285,199,775 web pages or even 4,285,199,776 web pages. It is exactly 4,285,199,774 web pages.

Does this not mean anything to you? Like I said, either Google is lying to us all, or possibly they hit a 4.2 billion limit and they aren't lying to us at all. Or they just "forgot", but as John said above, how in the world could they find the time to update their Google logo daily during the two weeks of the Olympics?
09/07/04 12:33:43
Panivino wrote:
Reply to:thePhysicist


thePhysicist wrote:
"Could be that the webmaster has set "no-cache", which google adheres to."

Look, I am not a computer person, just a search engine user.
For me February and March were the peak of 'empty' hits. I just got puzzled because such results are not very explicit nor interesting.

First it appears to me that Google can't help providing pointers to pages that are "recommended", especially if their url contains keywords, but not necessarily. I learnt that there was a file called robots.txt to exclude robots from certain parts of the site.
Well, Google displays those urls that it cannot crawl since it is compliant with this convention.
As a user I find this attitude more like spreading rumors than journalism cross checking the info.

You can see such examples by searching for: player.

There is a windowsmedia.com site which has a robots.txt excluding everybody:
# Make changes for all web spiders
User-agent: *
Disallow: /

Google displays the 'empty' hit windowsmedia.com/download around #16
But as you can imagine this site must be very popular.

This very precise case is not a drama, but when you search for something less popular the empty hit becomes annoying very often.

For instance a couple of hours ago while making a search on some types of filters in Spanish I searched for: filtro manga
and was returned as #9:
http://www.veto.cl/proceso/...*Valvula%20solenoides*Limpieza%20filtro%20manga&idgrupo=328&opcion=G

1k - En caché - Páginas similares

The link didn't answer. So I clicked the cache link to be told that the query words only appear in links, and the page google had in cache said "Untitled document".

Do you find that a sufficiently content rich document, even if popular, to answer my question?

Here is the cache link I used: http://www.google.com/searc...*Valvula%2520solenoides*Limpieza%2520filtro%2520manga%26idgrupo%3D328%26opcion%3DG+filtro+manga&hl=es&ie=UTF-8

The source shows it is a frameset with absolutely not a single word except 'untitled document', kind of thing I am able to search by myself the day I have nothing better to do. Whether there are coding errors or not, there is no content.

Another example, search for: bundas
(In Brazilian Portuguese butts, also a type of cat, also a kind of anorak).
You will get 3 in the top twenty. Two are dead and one is working.

http://www.bundasnet.matrix... is cached but with no code

http://www.bundas.kit.net/ has no cache and seems dead (many of those were dead 6 months ago)

bundasweb.cjb.net/ is live and has no cache.

We already found 3 in top 20 (15%) for a query returning 40,000 hits. Those three must be especially relevant, except we didn't see a butt yet.

So Google also displays empty links which are not broken, not blocked, some with cache, some without.

And this rather often corresponded for me to sites that had a problem of those exposed by Federico, sites which had been there near top for long and that almost completely disappeared or partially went the strange supplemental results category, while being found at top of all other search engines.

I have read times ago that this empty hits meant Google was to drop the page. But does Google warn you by displaying next month position for each result it shows you now?

I don't believe in this explanation, and Federico's statements make more sense to me. If Google has lost track may be one day or another it will drop this link, although the windowsmedia.com/download page has been there forever and Google still did not decide to drop it.

But I saw multiple yo-yo old version / new version / empty hit / old version etc. If you drop because you consider too bad, you drop.

When you are fed up with your old ball point pen, you trash it. When you got a new one, will you trash it and pull out the old one that does not work anymore? Or trash both but put a sticker on your monitor to remember you trashed them?

I don't understand which positive point would cause Google to display 'empty results', especially if they correspond to pages that their authors wanted to block by means you suggest.
And if they have not yet been crawled, once again I would like Google did not invite me to play poker with expected relevancy.

The web is changing all the time, today Google for me returned variable results, so it is difficult to give you twenty examples in a few minutes, but you can try by yourself and you will certainly find some good examples. The less competitive your queries the easier.
09/07/04 14:26:26
Anthony Federico wrote:
(Part one)
It doesn't matter if the 4,285,199,774 figure hasn't changed for 12 months, 10 months or even in the past 6 months. The main point is that when Google reached this 4,285,199,774 web pages, the number hasn't changed since. There is also the question of why 4,285,199,774 web pages and not the figure I show above of 4,294,967,296? Does't this leave about 9,767,522 DocIDs left? Well not really. What about all those robots.txt files they also need to keep track of? Don't you think these too need a DocID? Of course they do.

The robots.txt file and the posting by Panivino above, brings us to another topic of discussion. Google's handling of the robots.txt file. Panivino mentions that the less competitive your queries the easier it is to find these empty pages and problems with the robots.txt file. Here are some examples showing how empty pages exist in the index and will also show how Google will sometimes index pages that are explicitly denied access to by robots via the robots.txt file.

First the empty pages. During the Olympics I was searching to find everything I could possible find about the Olympics and I turned to Google for assistance. Sites like cnn.com and usatoday.com would certainly contain allot of authoritative information about this subject. So instead of doing a blind search and just entering Olympics in the search box, I instead restricted my search to specific sites. The first one I tried was usatoday.com by entering the following query:

site:www.usatoday.com olympics
(direct link) http://www.google.com/searc...

And 50% of the top 10 results were and still are empty results with no titles, descriptions or excerpts. So I decided to do some research to figure out why these empty pages were showing up. What I found is that usatoday.com explicitly stated in their robots.txt that these pages were off limits to all search engines and robots:

User-agent:*
Disallow:/olympics

So Google didn't index them which is a good thing. But why in the world does Google show the URLs in their search results and why are these more important than the other 29,590 Google says is available? Didn't usatoday.com already tell Google NOT to do anything with these URLs?

Now this in itself is not a big deal for usatoday.com, but it is a big deal to me the searcher. Now instead of getting 10 results of something I could quickly scan and decide if I want to visit or not, I had to click the "next" button to see more content which of course displayed more Google ads (which by the way were much better targeted to my query). As a Web site owner, would you like Google showing URLs that you told search engines not to fetch? I know I wouldn't. The site owner must have his reasons to block search engines access.
09/07/04 18:15:25
Anthony Federico wrote:
(part two of two)
Now speaking of Google's problems with the robots.txt file. Let's look at another example that shows just how stupid Google's robots are and show that Google will display the actual content of pages that are off limits to all search engines. Let's go to Google.com and enter the search query:

site:hotbot.com
(direct link) http://www.google.com/searc...

The above link will display "Results 1 - 100 of about 7,930 (as of this posting)". Now at first, this may not seem alarming to you. But if you look at the robots.txt file at hotbot.com, you'll find that they exclude all robots from indexing anything on their site. Here's their actual robots.txt file:

# No robot will spider the domain
User-agent: *
Disallow: /

Which means ALL robots are to stay out! But yet Google has about 7,930 of their links in their index. Now these are not all just empty pages, but many of the results contain content that was supposed to be off limits to Google.

Now I know, there will be those here who might say Google indexed these pages before hotbot.com had this robots.txt file published. But that is NOT true. In fact, if you know how to fetch the HEADERS of a request, you'll see that this file has NOT changed since:

Last-Modified: Thu, 26 Feb 2004 17:37:14 GMT

which is certainly way before Google last indexed the pages I'll present to you below.

http://216.239.57.104/searc...

If you look closely at the top of the Google cached results, you'll see Google proudly saying "as retrieved on Aug 26, 2004 20:01:19 GMT.". This is not Feb 25, 2004, but just the other day. Let's look at one more Google has for hotbot:

http://216.239.57.104/searc...

and Google again states "as retrieved on Aug 31, 2004 18:19:25 GMT.".

Now I know exactly why Google did not follow the robots.txt as instructed here. It is because Google did not send the correct request HEADER when requesting the robots.txt file. And because of this Google was unable to parse this file correctly if at all. To fix this small problem would take less than 2 minutes to correct. So if Google hasn't fixed this little bug which I have brought to their attention months ago, what makes you so sure they have updated their database of 4.2 billion URLs and the billions of indexed words to make room for the next 4 billion?
09/07/04 18:16:15
Richard Marks wrote:
Hi everyone. I just want to say that I love this article!!!!

After reading all this I decided to do some of my own research. What I did was look at some google snapshots I had from 2 years ago which had a few sites that were important to me. And what I found is this search.

http://www.google.com/searc...

which now showed me this result:

vancouver-webpages.com/vanlug/1998-5/0070.html

Google can show you the following information for this URL:

Find web pages that are similar to vancouver-webpages.com/vanlug/1998-5/0070.html
Find web pages that link to vancouver-webpages.com/vanlug/1998-5/0070.html
Find web pages that contain the term "vancouver-webpages.com/vanlug/1998-5/0070.html"

And just like you said Anthony, Google did not show a title, description or even an excerpt for this result. [b]But I can absolutely guarantee this was in Google 2 years ago[/b]. Then I did as you recommend and reviewed the robots.txt file.

http://vancouver-webpages.c...

as well as reviewed the HTML for any sign of a meta robots tag and the page is not being restricted from google. Then I used lwp-request to get the headers of the page which returned.

[b]Last Modified: Thu, 07 Sep 2000 05:41:41 GMT[/b]

This is fantastic! The page has not been modified since 2000 and [b]I know for a fact this page used to be in Google[/b]. Also I'm sure everyone will agree with me the vancouver-webpages.com site is a well respected site so keyword spamming or any other non-ethical search engine spamming technique would not have been applied to the page.

So just like you said Anthony, how can you argue with this? Google or any [b]Google lover[/b] can not say that Google hasn't had time to index this page! That will be the biggest joke on the Internet. Not only have they had 4 years to index it, but two years ago the page was there. So this only [b]proves things are broken[/b] in my opinion.

I have been playing around with this stuff all day and have found hundreds of others just like this one. I decided to share my results of vancouver-webpages.com because they are a very well respected site on the Internet and the Google lovers can't argue too easily about this one.
09/07/04 20:43:14
Kim Krause wrote:
RE: The Cre8pc site

The blog archives weren't indexed because I was using an old javascript link that was in use when I set up that site (using BloggerPro). The new archive tag has been out for almost a year but I wanted the older one because it looks much nicer than the long list of links. But, it didn't allow Google to index those pages.

I've since given in and applied the new code as of today. Google should gobble up those archives soon.
09/07/04 21:56:53
Richard Marks wrote:
Google and no other major search engine reads Javascript so don't count on these pages being re-indexed because of the removal of Javascript anytime soon. If you can show me a reputable SEO that says differently, I'll show you an SEO that should be washing dishes instead of giving SEO advice. Reply to:Kim Krause

09/07/04 22:56:44
Bill wrote:
Reply to:Richard Marks

Richard, there are some limitations to reading javascript. For instance, the javascript implementation that blogger was using to display archives on the front page of a blog only showed the last four or so archives pages, and required that someone click on a link to expand the list of links.

The result was that all of the other pages weren't being followed by Googlebot and spidered, and are not indexed.

There's another pecularity about the weekly archives pages from blogger (like the ones linked to above). They tend to have all the same title, and if you search for the page itself, it usually just grabs the first sentence or two as the descriptive text. The links to the cre8pc blog above do that when you throw the links into the Google search box. Give it a try yourself. The archive pages that were displayed on the front page of the blog are indeed in Google's index.

Regardless of whether or not Google is truly is broken, the example of the Cre8pc site doesn't seem to show what the author of this article intended.
09/08/04 08:43:48
Dan Efergan wrote:
It's not just his computer, this site is not working correctly in Safari 1.2.3 (G5)
09/08/04 11:27:21
Anthony Federico wrote:
Bill wrote:
The result was that all of the other pages weren't being followed by Googlebot and spidered, and are not indexed.

The problem with the Cre8pc pages as described above is not that Google was unable to find/discover them. The problem was that Google has indeed discovered them as we can clearly see in Google's SERPs, but these pages are not indexed them (or were indexed sometime before and later the content was dropped). So even though the HREFs to these pages may have been in Javascript, doesn't answer the question of why after Google's discovery, Google does not index them.

According to Google's own search results, they have indeed already discovered these pages as these show up in Google results. The question now is posed as to why Google fails to index them. You may find these pages in the results a month later, but then..... they disappear all over again. This is ongoing at Google and you'll find 10's of thousands of posts on the Net complaining about the same thing.

I think we are getting a little off track here. One of the biggest questions here is why 4,285,199,774 web pages for almost a year? As stated in another post here, Google displays this number as if it were part of their own logo. Even yesterday Google finds the time to update their logos to celebrate their 6th birthday, but yet they fail to update this 4,285,199,774 figure. This in my opinion is quite alarming and can only prove Google is either stupid, lying to us all or has been telling us the truth and URLs are added just as fast as they are dropped.

The questions posed here are:

1. Why after several months does Google proudly display 4,285,199,774 web pages, but yet they seem to have the time to update their Logos on a daily basis?

2. Why are still valid and active pages dropped after being in Google's index for years?

3. Why does Google give us results for empty pages of the same URLs for months at a time?

4. Why do empty pages Google claims to not have indexed and are empty results, rank higher than pages which have been indexed?

5. Why does Google crawl sites that are clearly restricted to robots?

6. Why does Google include URLs in their SERPs that again, are restricted from all robots including Google? This only hurts the quality of search results.

I encourage you all to write Google at webmaster@google.com and ask them to address these questions. Give them a direct link to this page if you wish and see if they'll respond. Most likely all you'll get back from them is a canned standard response like this one:

Thank you for writing to Google. This automated response is just to let
you know that we've received your email, and you'll hear from us soon.

Thank you for using Google.

Regards,
The Google Team

And that's the last email you'll receive even though they say "you'll hear from us soon". But if you're lucky, you might even get another canned response giving you links to their Webmaster pages which again, will not address this questions.
09/08/04 13:05:41
Anonymous at this time wrote:
Thanks for the excellent article Anthony.

Starting in November of 2003 we started noticing several web pages of ours at Google showing this empty page content you are discussing here. At first we just thought it was a Google hic-cup. But by January of 2004 our firm became aware that this was not just a hic-cup because these pages continued to show empty in Google's results. The first week of January we contacted Google and they returned that canned response you refer to.

After a couple of weeks of waiting for that Google reply we were promised, we decided to monitor all traffic on our server very closely and have recorded every single visitor in order to see if we could log all Google visits to these pages. Needless to say we captured all GoogleBot visits and verified all IP's were Google.

After two additional months of monitoring hits to these pages, we then presented this data to Google and again ask the question why. Still no response from Google.

Our business Web site has been on the Internet since 1995 and we have always been in Google's index since they were launched in 1998. We are in all the other search engines without problems. The only one that appears to have trouble is Google. Roughly 10% of our pages are now empty in Google's results while the other 90% of the site remain. Our site is not huge and under 70 pages. All pages throughout the entire site is template based. By this I mean that all headers and footers are identical except for the titles which are completely different on every page. All pages are hand coded so there is nothing fishing going on here. We do not use any form of meta tags other than a brief description of no more than 75 characters which only describes the document. This description would be a perfectly acceptable extraction that a site like Yahoo and the DMOZ would use for their own directory listings. We are in both Yahoo! text search as well as in both the DMOZ and Yahoo's directory and have a very high Google PageRank. We do not block robot traffic to any Web pages either.

Since January of this year and every month since, we have contacted Google and requested an explanation. Still to this day we have not received a response. Because of this our IT department has been continually monitoring our site for all signs of Google traffic or any sign of errors on our end that might prevent Google's visits. We now have logs dating back from January 9th 2004 to this very moment. Google can not say they have not visited these pages or say that they have not fully downloaded them. We have also watched all TCP download to Google as well.

Below I will share with you over 126 Googlebot visits for one single page of which Google continues to show as empty in their search results. Instead of presenting all 126 hits, I will only show the first couple of months and the last couple of months. The google visits between are similar. The total number of hits to our site by the Googlebot since early Jan 2004 to this very moment exceeds 7,000 visits and still, many empty pages at Google, but no other engine has trouble.

I'm sorry, but I had to post the hits with the following post.
09/09/04 01:02:42
Anonymous at this time wrote:
1/12/2004 19:32 64.68.82.136 /services.html
1/13/2004 16:39 216.239.39.5 /services.html
1/16/2004 3:05 64.68.87.69 /services.html
1/17/2004 2:04 64.68.82.27 /services.html
1/24/2004 12:06 64.68.86.57 /services.html
1/24/2004 22:44 64.68.82.159 /services.html
1/29/2004 18:24 216.239.45.4 /services.html
2/1/2004 13:46 64.68.86.38 /services.html
2/2/2004 2:32 216.239.37.5 /services.html
2/9/2004 9:19 64.68.86.38 /services.html
2/12/2004 5:59 64.68.82.135 /services.html
2/15/2004 19:36 216.239.37.5 /services.html
2/16/2004 19:58 216.239.37.5 /services.html
2/16/2004 21:26 216.239.39.5 /services.html
2/17/2004 18:35 64.68.86.61 /services.html
2/23/2004 3:03 64.68.82.44 /services.html
2/25/2004 10:05 64.68.87.69 /services.html
2/25/2004 22:31 64.68.86.54 /services.html
3/5/2004 3:58 216.239.39.5 /services.html
3/5/2004 7:51 64.68.86.149 /services.html
3/6/2004 1:20 216.239.45.59 /services.html
3/12/2004 16:49 64.68.86.138 /services.html
3/13/2004 6:32 64.68.82.46 /services.html
3/16/2004 13:14 64.68.86.138 /services.html
3/19/2004 18:58 64.68.86.38 /services.html
3/24/2004 4:19 216.239.37.5 /services.html
3/25/2004 8:29 64.68.82.79 /services.html
3/27/2004 8:12 216.239.37.5 /services.html
3/29/2004 12:34 64.68.82.79 /services.html
3/30/2004 13:30 64.68.87.41 /services.html
4/4/2004 16:55 216.239.37.5 /services.html
4/6/2004 14:57 64.68.82.46 /services.html
4/8/2004 8:09 64.68.86.38 /services.html
4/8/2004 14:56 216.239.37.5 /services.html
4/9/2004 15:48 216.239.39.5 /services.html
4/10/2004 2:52 64.68.82.55 /services.html
4/10/2004 10:42 216.239.37.5 /services.html
4/15/2004 8:18 216.239.39.5 /services.html
4/16/2004 7:32 216.239.39.5 /services.html
4/16/2004 12:54 64.68.86.15 /services.html
4/17/2004 19:12 64.68.92.183 /services.html
4/19/2004 2:16 64.68.82.159 /services.html
4/20/2004 23:16 64.68.82.55 /services.html
4/22/2004 11:57 64.68.86.149 /services.html
4/23/2004 0:55 64.68.82.164 /services.html
4/23/2004 10:13 216.239.37.5 /services.html
4/28/2004 8:22 216.239.39.5 /services.html
4/29/2004 13:01 64.68.87.41 /services.html
4/30/2004 3:10 64.68.82.174 /services.html

---- break for smaller post, but date between these dates are similar

8/1/2004 16:07 64.68.81.140 /services.html
8/3/2004 15:43 216.239.39.5 /services.html
8/4/2004 2:12 216.239.39.5 /services.html
8/4/2004 22:12 216.239.39.5 /services.html
8/6/2004 7:29 216.239.39.5 /services.html
8/9/2004 19:59 64.68.82.181 /services.html
8/10/2004 0:37 64.68.82.159 /services.html
8/10/2004 3:42 216.239.39.5 /services.html
8/10/2004 19:47 216.239.39.5 /services.html
8/10/2004 19:48 64.68.82.181 /services.html
8/11/2004 14:57 64.68.83.41 /services.html
8/11/2004 20:00 216.239.39.5 /services.html
8/13/2004 2:39 216.239.39.5 /services.html
8/13/2004 11:47 64.68.81.196 /services.html
8/15/2004 2:53 216.239.39.5 /services.html
8/15/2004 3:05 216.239.39.5 /services.html
8/16/2004 23:36 216.239.39.5 /services.html
8/21/2004 3:08 216.239.51.5 /services.html
8/24/2004 4:19 64.68.82.27 /services.html
8/27/2004 11:17 216.239.37.5 /services.html
8/27/2004 11:17 216.239.37.5 /services.html
8/28/2004 10:01 216.239.39.5 /services.html
8/31/2004 19:17 64.68.83.153 /services.html
8/31/2004 19:17 64.68.83.153 /services.html
9/2/2004 12:36 216.239.39.5 /services.html
9/3/2004 3:24 64.68.83.173 /services.html
9/6/2004 10:32 64.68.82.142 /services.html
9/6/2004 18:14 216.239.39.5 /services.html
9/6/2004 18:15 216.239.51.5 /services.html
9/7/2004 4:18 64.68.82.79 /services.html
9/8/2004 14:45 216.239.39.5 /services.html
9/8/2004 16:18 64.68.83.140 /services.html
09/09/04 01:05:18
Pro Google wrote:
Damn. Personally I think Google rocks, and has helped me as a developer for a very long time. If they are indeed experiencing problems (which I don't think are because they're evil) then I wish we could help them sort em out. Alot of us can't imagine a world without Google and this would be in our best interest if it were possible.

Alot of focus on the negative here, which is good in identifying the pains, but how about some solutions?

Can anyone think of a fast and clever way to upgrade their document identifiers? Maybe post a few ideas in the hopes that someone will have a really smart idea and a Google engineer will see it? Pretty hard without knowing more of the details of their design, but maybe a few ideas could help anwyays.

Too bad they're not opensource yet.
http://www.webpronews.com/n...
09/09/04 10:26:56
Nathan Weinberg wrote:
There are some explanations just coming out now about this subject. I've written a post about them <a href="http://insidegoogle.blogspo... my blog, InsideGoogle</a>. Hopefully that answers some of the questions.

One question, Anthony, are you the same Anthony Frederico who codes for ScrubTheWeb.com?
09/09/04 12:56:40
Daniel Brandt wrote:
Chris Ridings, the owner of Search Guild, has written an article critical of me at http://www.searchguild.com/...

I tried to point out the deficiencies in his technical assumptions on his forum over a year ago and got nowhere.

Then I tried again last week, and after a few exchanges he locked the thread. Now he has written the above article, which I feel deserves an answer. Since he locked that thread, I can't post my answer on his site.

First of all, Chris has made remarkable progress in his understanding of inverted indexes. A year ago his arguments were quite absurd. Now his description of how an inverted index works is fairly good.

However, he's obviously spinning things. I'd be less suspicious of his motives if Google Adsense ads weren't plastered all over his site. But I'll leave that aside for now.

Chris and I finally agree that if Google is still using a 4-byte docID, then they have to go to 5 bytes. I suggested in my piece a over a year ago that they could read in an extra byte, mask off the bits they don't need for the new docID, and use this as a multiplier for the old 4-byte integer.

Chris says I'm full of it because he would use a "long long" integer of 8 bytes, and strip off the unused 3 bytes on read and write.

Both methods require several extra lines of code for reading and writing. Sure, I'll do it the way Chris suggests. Six in one and a half-dozen in the other. No big deal. Undoubtedly Google would study the CPU cycles for each method and pick the one that is most efficient. I wouldn't hazard a guess about the relative efficiency of each method without experimenting.

This issue of methodology that Chris uses as a basis for his entire criticism is a red herring. Each method requires extra code, and that is the essential point. Lots of code has to be changed. Since the docIDs have now gone from 4 to 5 bytes, and they are all packed tightly back-to-back, all the offsets have now shifted. That means the all code that scans the inverted index for docIDs has to be changed.

And the space problem is not trivial, as Chris wants us to believe. Every word on every web page in Google's index gets its own docID. In fact, since Google uses two inverted indexes, this means that every word on every web page uses, on average, two docIDs. That's a lot of space -- about 2.4 terabytes of added space per copy of the inverted index. There are many copies of this index in RAM on Google's distributed system at any one time, because it's the first index that has to be consulted when handling any search request.

Chris's argument is an exercise in generating fog. I expect this sort of thing from the Googleplex, but not from a mere fan of Google's Adsense program.
09/09/04 15:05:05
Anthony Federico wrote:
Nathan Weinberg wrote:
There are some explanations just coming out now about this subject.

That's great you have address how Google might expand and even as I have pointed out here, it is certainly possible to change the docid. But this does not answer the questions I present here which is what all of us are after. The theory of the 4-byte problem is what I said could explain why they have not changed their statement of "Searching 4,285,199,774 web pages" for about a year now. But it does not answer the question "Is Google Broken?". Now if you can answer the questions presented with athority, we'll all be happy to hear. One of these questions you have answered:

1. Why after several months does Google proudly display 4,285,199,774 web pages, but yet they seem to have the time to update their Logos on a daily basis?

According to your Google query +the which returns about 5,700,000,000 does indeed show that Google has been lying to us. Your answer to this is that Google will not change this until AlltheWeb or Yahoo or somebody else has announced a bigger index. But don't you think this is rather rediculous? After all, Google proudly announces this on their home page day in and day out. It is an exact number.

It is the usues below that make "Google Broken". If you can answer these questions then we are all ears:

Q: Why after several months does Google proudly display 4,285,199,774 web pages, but yet they seem to have the time to update their Logos on a daily basis?

A: We forgot. Or we are the biggest so it doesn't matter what number we use even if it is not true.

Why not something more vague like "Searching the largest database in the world"?

Q: Why are still valid and active pages dropped after being in Google's index for years?

A: I don't know???? Maybe some filter Google added caused this with about 1 billion URLs.

Q: Why does Google give us results for empty pages of the same URLs for months at a time?

A: Your page is currently partially indexed, which means that although we know about your site, our robots have not read all the content on your page(s) in past crawls. And http://www.google.com/webma...

5. How long does the Google robot take to index a URL once it's been submitted?

Depending on the timing of the submission and of our crawl, the entire process can take between six and eight weeks.

So "Anonymous at this time", you should have seen this corrected 8 months ago, but you may need to wait a little longer.

Q: Why do empty pages Google claims to not have indexed and are empty results, rank higher than pages which have been indexed?

A: Unlike many search engines, Googlebot can return results for pages that are known but haven't been crawled yet. Since we haven't looked at those pages yet, their titles aren't shown; the Google results page displays the URL instead.

Q: Why does Google crawl sites that are clearly restricted to robots?

A: Unlike many search engines, Googlebot can return results for pages that are known but haven't been crawled yet and never will. So we thought we would keep the URLs in our index and clutter results just in case you change your mind later.

Q: Why does Google include URLs in their SERPs that again, are restricted from all robots including Google? This only hurts the quality of search results.

A: Unlike many search engines, Googlebot can return results for pages that are known but haven't been crawled yet and never will. So we thought we would keep the URLs in our index and clutter results just in case you change your mind later.

I realize you are not a Google spokesperson, but these are the questions people want answers to. The docid theory is just a theory based on their never changing figure of 4,285,199,774 and materials published at Stanford written by the Co-Founders of Google.

It's not like we are asking Google to share their trade secrets or anything like that. We just want to know why Google is lying and hurting so many businesses that rely on Google traffic. Google placed themselves in the public and asked us all to invest in them which we have. Google should now answer to the public and tell us why they are destryong our businesses.
09/09/04 15:50:03
Mr. Yuan wrote:
Reply to Daniel Brandt and Nathan Weinberg:

I agree with Daniel. One thing of thought here is data alignment and C. Some processors require that data be aligned (like the ones used by Google). That is, two byte quantities must start on byte addresses that are multiples of two; four byte quantities must start on byte addresses that are multiples of four; etc. The general rule follows a progression of exponents of two (2, 4, 8, 16, ?). Some processors allow data to be unaligned, but this most always results in a severe slow down of performance. There is also a very large cost for processing 8-byte double precision numbers that are not aligned in memory if you want to go tot he 8-byte arena?

Some CPUs (like the ones used by Google) require strict alignment, at least of the usual integer and floating-point data, i.e., an 8-byte object must be aligned on 8-byte boundaries, else there is a trap and considered an error. Objects in the managed heap are forced to pointer sized alignment (4-byte on x86, 8-byte on x64/IA64). So to take 4-byte to 5-byte is not what one would normally do. Instead you go from 4-byte to 8-byte and move everything to a 64 bit CPU.

However, this would require a 64 bit architecture would it not? I don't believe Google has swapped out their 10,000+ "inexpensive" desktops for 64 bit CPUs. If so, we then have to consider many other areas like bit shifts and bit masks for the conversion process like Daniel pointed out which will not be a quick and dirty port like Nathan suggests.

On an Intel or an Intel-compatible PC, will it not use little endian byte order? The 4-byte integer 66051 is written as 0x00010203 on a big endian system (64 bit) and as 0x03020100 on a little endian system (32 bit). Mucho conversions needed here I'm afraid.
09/09/04 16:05:03
ILoveJackDaniels wrote:
This is the funniest thing I've read in months. The number on Google's front page hasn't changed in a year, and automatically a bunch of paranoid controversy-courting "writers" come up with the idea that Google must have run out of space! That's a logical step!

Question 1. Why would Google change that number regularly? Does it matter? Does it affect the running of the site? Do users care? Or more likely do they just say "Oh wow, 4 billion pages ... one of them must have the info I'm after". They have tens of thousands of PCs running their site, all of which would need to be updated with a new front page. Changing the front page is probably the least important thing in the World to them.

Question 2. What makes you think this would be a problem? Let's say for example that this load of tripe is even close to the mark and that there is an upper limit to the document ids and that they reached that a year ago. According to this, it's stopping them index more pages. Do you really believe that with all the PHDs they have there that they would have a difficult time coming up with a solution? And that it would take over a year?

Question 3. What about the 14-million-odd missing pages?

Question 4. How do explain the fact that their docids don't appear to be numeric?

http://www.google.co.uk/sea...
http://www.google.co.uk/sea...

The two links above provide the same page. Since the URL has been editted in the second link to point elsewhere, we can safely assume it is the "X3GuszbKqvwJ" that identifies the URL to view. It is reasonable to assume that that is the document id of that page - there would be little point in using anything else. Assuming that Google uses a 12 digit ID for every URL (which seems reasonable), and that the ID is a case-sensitive alphanumeric ID (which it appears to be), Google can index up to 3,226,266,762,397,899,821,056 URLs before running out of doc ids.

Why do people always look to the most exciting explanation first? Google haven't bothered updating their front page. Oh well, I'm sure millions of people the world over are losing sleep at night wondering when it will be changed.
09/15/04 10:18:08
Anthony Federico wrote:
To answer ILoveJackDaniels who said on 09/15/04 10:18:08:

Answer to "Question 1. -Why would Google change that number regularly? ....". Why not? They state an exact number of 4,285,199,774. Now if they said something like "more than 4 billion" then this would not be a concern for many people. But instead they are very specific to the number of available documents to search. My question is why lie? If they can update their logo on a daily basis, why not update this number or just remove the freaking thing?

Answer to "Question 2. What makes you think this would be a problem? ....". Because it means Google is either lying or not lying which is something they should have disclosed before going public. If I were to invest in a company and they gave me an exact figure for anything, then I rely on that number not being a lie. This number is an exact figure for which confirms Google as being the biggest. But when that number does not change, you then have to believe their is a problem with adding new documents or that Google is lying. Both of which are a concern for investors. I believe new pages are added and old pages are not removed which increases the index's size, which ultimately corrupts the doc_id and references/links to the doc_id which will confirm their ability to properly count the number of documents in the index.

Answer to "Question 3. What about the 14-million-odd missing pages?". I'm guessing you are referring to the differece between the 4,294,967,296 and 4,285,199,774 which is about 9,767,522 doc_ids. This has already been answered, but to answer it again, it's because robots.txt files also have doc_ids too. Don't believe me, read on.

Answer to "Question 4. How do explain the fact that their docids don't appear to be numeric?". This is a very simple question to answer. If you read the documents and understood what they say, you would realize that this alpha-numeric string for the cache is the MD5 checksum of the "URL" and has NOTHING at all to do with the doc_id.

Let's say this doc_id issue has been solved. OK now we can just put this behind us and concentrate on the "real" issues the Google is broken issues. As stated above, answer these questions:

1. Why after several months does Google proudly display 4,285,199,774 web pages, but yet they seem to have the time to update their Logos on a daily basis?

2. Why are still valid and active pages dropped after being in Google's index for years?

3. Why does Google give us results for empty pages of the same URLs for months at a time?

4. Why do empty pages Google claims to not have indexed and are empty results, rank higher than pages which have been indexed?

5. Why does Google crawl sites that are clearly restricted to robots?

6. Why does Google include URLs in their SERPs that again, are restricted from all robots including Google? This only hurts the quality of search results.

To show an example of the robots.txt problem for the site you mention here "ilovejackdaniels.com", let's do a search at Google:

http://www.google.co.uk/sea...

Why does Google show all these empty URLs that the robots.txt for this site tell Google to stay away from? Why does Google clutter the index with these empty pages? What value do they have to anyone other than possibly to hackers?

http://www.ilovejackdaniels...

User-agent: *
Disallow: /404.php
Disallow: /friend.php

Is Google not told to stay away from the /friend.php? So why include them here?

And why does Google index and present the robots.txt for this site?:

http://www.google.co.uk/sea...

Or would you rather see it this way?:

http://www.google.co.uk/sea...

Now all these things are totally nuts! No one in their right mind would say this is normal and that this is not a sign that Google is broken.
09/15/04 13:18:12
Chris Rutledge wrote:
Is it not more likely that in the scripting of google's home page, the variable used to display the number on the webpages in the index has reached it's limit?
09/15/04 16:34:11
Richard Romero wrote:
Chris Rutledge, why not forget about the reasons Google continues to display the number 4,285,199,774 on their home page and concentrate more on their well known a clearly documented problems like those pointed out here by Anthony?

I've been following this article and watched other forums make comments about this article. The Google Lemmings will always have CraZY answers to these problems, but not one I found was intelligent and thought out. I read one post saying that Google showed the robots.txt file in their search results because the owner of the site doesn't know how to create a robots.txt. What a stupid assumption. For those who think the robots.txt files being displayed in Google's index is because the site owners don't know what they are doing, I have to ask to visit this Google result:

http://216.239.39.104/searc...

That's Google's robots.txt file. Now please tell me Google doesn't know how to create a robots.txt either.

Google is so terribly broken now that a fix for the docid issue is not what it will take to fix the many flaws of Google and probably is not much of a concern for the daily Google user. But these flaws are quickly eating up the Google index. These days you can't find anything interesting on Google. It's just one huge mess of empty pages, listings of restricted pages, meta redirects, javascript redirects, pagerank thefts, pagerank manipulations and so on. My latests searches show almost 50% are empty page results. That's terrible in my opinion and should be of great concern for anyone doing business on the Internet. This type of index and search results do little good for you the page author or you the searcher.
09/15/04 17:27:14
Daniel Brandt wrote:
I agree that the docID issue is much less important than the fact that half of Google's index is empty, and there's this Supplemental Index that helps no one and make no sense at all. There are other issues as well -- the disconnect between Google's crawling and its indexing is dramatic. They crawl probably ten times more pages -- including the repeat crawls for the same page -- per site than they index. By now Google's crawler is almost like spam for webmasters who need to watch their bandwidth. You can disallow them, but then you get your URLs listed anyway, so there's much less privacy benefit to disallowing Googlebot than there should be.

Also, I continue to be amazed at the excellent pages that get filtered out, as demonstrated by Scroogle at http://www.scroogle.org/scr...

It was really shocking to webmasters last November, and Google was forced to turn the knob way down on their filter within three weeks of the instant uproar. But they didn't turn it off, and excellent pages continue to get caught in the filter. Why not turn the darn thing off? It clearly doesn't work.

But I still think the docID issue is interesting, for several reasons:

1) It brings the Google cultists out of the woodwork -- the ones who are technically uninformed or haven't read up on Google's technical history. Either there's a fascinating mass-psychology issue here with respect to Google's branding, or there's something else going on. As someone who studied cultism in grad school, this interests me.

2) If there's something else going on, it's possible that there's some sort of disinformation campaign being encouraged at the Googleplex. If this is happening, it's illegal now that Google is a public company, and someone belongs in jail.

3) The docID issue is a succinct description of one possibility that explains Google's deterioration. If it's true, it's exactly the sort of concrete situation that should have been disclosed in the pre-IPO prospectus. The prospectus is a liability document. If you can prove the docID issue, the fact that Google failed to disclose it in the prospectus makes Google potentially liable to shareholders who lose money. All it takes to prove it is an inside whistle-blower. All it takes to disprove it is full disclosure by Google, which they should be doing anyway. That's why it is interesting to push the docID issue. Tension is building. The only way we will ever get answers about the decline of Google is to increase this tension to the breaking point.
09/15/04 18:04:11
Anonymous at this time wrote:
Since the first week of January 2004 we contacted Google every month (see my post - Anonymous at this time 09/09/04) and all we received was a canned response. Then on September 8 2004 I wrote Google again. I included the same very strong data proving Google has serious problems. This includes snapshots of Google results with our web pages indexed and their thousands of Googlebot visits. In addition I pointed them to this page. Today they finally answered:

-----

Thank you for your note. We apologize for our delay in responding to your email. The Google index contains two types of pages: fully indexed and partially indexed pages. It appears that some of your pages are currently partially indexed. Because our robots were unable to completely review your site's content during our last crawl, your pages appear without cached copies or detailed titles. Instead, they're listed by their URLs.

We understand the frustration this situation may cause you. We are always working to increase the number of fully indexed pages in our index. You may be able to improve these pages' visibility in our search results by ensuring that a number of high-quality sites link to them.

Please note that although our robots may visit your site, we cannot guarantee that your pages will be thoroughly crawled or indexed. However, 'crawler-friendly' pages have a greater chance of being fully indexed. Guidelines for creating a 'crawler-friendly' site are available at http://www.google.com/webma...

Please keep in mind that our search results change regularly as we update our index. Normal changes you observe may include, but are not limited to, addition of new sites, changes in the ranking of existing sites, sites falling out of the index or getting dropped for particular keywords, and fluctuation between old and new webpage content.

We realize these changes can be confusing. However, these processes are completely automated and not indicative of wrong-doing or penalization of individual sites. We currently include over four billion pages in our index, and it is certainly our intent to represent the content of the internet fairly and accurately.

In regard to the pages for which access has been restricted with a robots.txt file, please note that although a robots.txt file prevents our robots from crawling your pages, it will not prevent our robots from adding a link to your pages without crawling them. It is likely that our robots found your pages because other pages linked to them. Our robots added the links to our index without actually visiting or crawling the pages.

Although a robots.txt file usually prevents pages from appearing in our search results, the only fool-proof ways to keep them out of our index are to make sure that no sites link to them, password protect them, or remove the robots.txt file and use a NOINDEX meta tag instead.

You can also remove the pages in question by submitting your robots.txt file for immediate review at http://services.google.com:... However, please note that this will only temporarily remove your pages from our search results. If sites continue to link to them, them may be included again in our search results.
Regards,
The Google Team

-----

It took them 7 months and 7 days to send me this?

Did you notice they said "We currently include over four billion pages in our index"? They didn't say over 5 billion did they? Hmmmmm, looks like Anthony is right.

No other major search engine includes empty links in their index which point to pages that are restrcted by the robots.txt. No other engine indexes pages that are restricted by the robots.txt either. So Google, why do you do this?

Google is a complete mess! They blame everything on automation and you the page owner for their problems. So Google, where's is quality control in your Googleplex?

I encourage everyone to post in every SEO and webmaster forum they can showing these Google problems. Get the word out and don't let the Google lovers discourage you. If you're an AOL or Earthlink customer, ask them why the search results suck (delivered by Google). CNN dropped Google so why not AOL, Earthlink? Get your ISPs aking Google these questions. Power to the people!
09/15/04 20:08:13
Daniel Brandt wrote:
You still got a canned response. It's just a bigger can this time. Every paragraph in that response is from some paragraph somewhere on their site. I recognize all of it. The mailbot has been unleashed again after someone did some cut and paste on the canned response, to make it longer and seemingly more responsive. The cone of silence lives on at Google. They've got everyone so snookered about how great their search engine is, that the farthest thing from the minds of Silicon Valley pundits and Wall Street analysts is to ask hard questions.
09/15/04 23:30:25
Never mind... wrote:
Here's more you can add to your Google robots.txt bug.

w3.org:
http://216.239.41.104/searc...

searchengineworld.com
http://216.239.39.104/searc...

seo-guy.com
http://216.239.39.104/searc...

codestyle.org
http://216.239.39.104/searc...

webcrawler.com
http://216.239.39.104/searc...

slashdot.org
http://216.239.39.104/searc...

cisco.com
http://216.239.39.104/searc...

webmasterworld.com
http://216.239.39.104/searc...

ibm.com
http://216.239.39.104/searc...

searchguild.com the guy who wrote this http://www.searchguild.com/...
http://64.233.179.104/searc...

Looks like some Google defenders have problems with Google and don't even know it. I agree 100% that Google is a mess.
09/16/04 13:20:27
Daniel Brandt wrote:
One problem is that for files with an extension of .txt, Google goes in and tries to appropriate anything that looks like a link. I used to have a hundred .txt files on one of my sites. Google went in and grabbed everthing it could find that looked like a web link, and turned it into a clickable anchor in the cache copy. There was no way I could put in a meta to tell Google to not cache these pages, because in a text file there's no header format at all. So they all got cached, even though as a matter of policy I don't permit any caching of any of my pages on any of my sites. I put in a NOARCHIVE meta to do this (you need one on every page, unfortunately). Now I don't have any text files, because I don't like giving up this much control.

Google's toolbar is another problem. It passes along any links it doesn't know about to the Googlebot when the toolbar phones home to grab PageRank. This is one way it finds those directories with all the credit card info. Sure, those directories shouldn't even be there, but how many webmasters know how aggressive Google is when it comes to grabbing links? All it takes is for one person with a toolbar to wander into a semi-private directory that was never intended to be posted on the web, and Googlebot swoops down like a hawk, indexes and caches it, and that's the end of privacy. If you asked that webmaster what a "robots.txt" file was, he'd probably look back at you with a blank expression on his face.
09/16/04 17:29:52
Never mind... wrote:
Hi Daniel Brandt, of course any file called robots.txt (case sensitive) in the root directory is ALWAYS considered the robots.txt file and the use of this file is by robots only. That's just how ALL search engines, LWP, RobotRules and so on works. It's common knowledge that this is what that file is. No other search engine on the planet that I know of indexes and makes this file and the contents found in this file searchable. Google's interpretation of this file is not correct. They should not be so stupid as to index the robots.txt. Now if they find this file instead:

/somedirectory/robots.txt

Then by all means, that is an indexable file and is prefectly acceptable for indexing. Google is full of flaws and this stupid mistake only shows how infantile Google is. If they can't fix something this stupid, what makes people think that their all mighty search engine isn't full of bugs on a greater scale.
09/16/04 18:26:23
THe12 year old problem solver wrote:
we shouldn't all email them... we could find some way to check how many they acually have and yah...
09/16/04 21:17:48
Chris Rutledge wrote:
PS

I don't see google changing it's name to 4,285,199,774. Do you?
09/19/04 16:07:17
Chris Rutledge wrote:
Reply to:Richard Romero

[b]Quote[/b] Chris Rutledge, why not forget about the reasons Google continues to display the number 4,285,199,774 on their home page and concentrate more on their well known a clearly documented problems like those pointed out here by Anthony? [b]End Quote[/b]

Because I was replying to the origininal question of the number, and if this related to google being "broken", or not.
09/20/04 09:00:58
Brian B wrote:
5,790,000,000 as of 27/09/2004
09/27/04 05:57:00
Anthony Federico wrote:
Brian, Google's home page still reads - ©2004 Google - Searching 4,285,199,774 web pages and has for a very long time.

You are probably referring to a returned search query like "+the" which Google will return "about 5,700,000,000", but this figure it not correct at all. In fact, if you read Google's "Advanced Search" page:

http://www.google.com/help/...

You'll find that you search using the OR operator too which will help you discover the Google flaw. Let's do a search query test:

apple <- returns about 42,300,000
http://www.google.com/searc...

pie <- returns about 8,590,000
http://www.google.com/searc...

apple OR pie <- returns about 6,560,000
http://www.google.com/searc...

Now how screwed up is that? The point here is that you can NOT take the "about ...." figure as having any relationship to actual documents. This figure is NOT correct and never was. The figure here is thown out there to make you happy that you got so many pages to look at. Of course Google will only show you the top 1,000 anyway so they know any figure they throw out at you is OK.

I believe Google purposely displays the 5.7 billion figure in search results to cover up and satisfy easily convinced users that Google has more searchable documents than they really do. Many people take Google's word for the 5.7 billion figure without any research at all, but yet those same people won't also take Google's exact figure of "4,285,199,774" being displayed on their home page.

If the 5.7 billion figure is correct (which it's not), then it would mean that Google's advanced search syntax is also broken (which it is). Basically Google is a broken mess and you can see this with almost every search query.
09/27/04 10:42:34
Daniel Brandt wrote:
I agree completely with Anthony about Google's numbers. Any count over 1,000 is unverifiable, because Google and Yahoo never show more than 1,000 links. It makes no sense whatsoever to spend the CPU cycles to generate an accurate count if the number is over 1,000. Instead they guess. Maybe their guess is based on a very crude extrapolation of how deep they had to go at the point where they realized it was above 1,000.

I have a site with about 250 directories. About a dozen of these have over 1,000 static files in that directory, and the rest have fewer. I've found that the counts are fairly good under 1,000, and when the directory is over 1,000 files, the count is off by 50 to 100 percent.

I've stopped watching the total count numbers for my big site. Instead, I add a carefully-selected keyword after the site: command so that I'm basically getting a cross-section of the entire site, but the total is around 500. This total is verifiable by scrolling down the results, and is easy to compare with the total that I know Google or Yahoo should have if they indexed me 100 percent.

You'll go insane if you watch numbers larger than 1,000, because it's bogus.
09/28/04 21:46:00
Dilbert wrote:
Reply to:arius

You are a moron. Of course the page everyone uses to find stuff on the internet should be 100% transparent in it's workings and subject to intense scrutiny and criticism- otherwise - how could it ever get better or live up to it's (currently false) claims? It could only get worse if it was cloaked in secrecy.. like our government for instance.
10/23/04 12:54:20
arius wrote:
Reply to:Dilbert


Dilbert, have you never heard that people see others as they see themselves? You should never resort to name-calling as it reveals too much about you. As for Google transparency have you never heard of trade secrets? It’s the foundation of a company’s competitive ability in any marketplace. Google would be completely subject to manipulation if how it worked was widely published. You’re saying that we should return to the days when keyword stuffing allowed porn sites to rank high for arbitrary keywords. Search relevancy would be lost. We would all lose a valuable resource if anyone could manipulate their way to the top of the rankings. The voting system of link popularity has quality built in. Everyone who links to a site is voting with their own reputation and credibility. Anyway, Google is not broken. As of Nov 10th 2004 they index over 8,000,000,000 pages. I hope they continue to provide fast, relevant search for a long time. We all benefit from it.
11/11/04 14:43:54
Does it matter? wrote:
Arius, Google is a very simple search engine to manipulate regardless of their PageRank algorithm. PageRank is only one easily manipulated factor and the so called "porn" sites have many methods in place to rank #1 even for terms that are completely separate from what they are actually selling. Link manipulation is a simple task these days with named/IP virtual hosting, keyword generation via referring URLs, cloaking and cheap $4 month hosting services. These are all simple methods used every day to manipulate not only Google, but every other search engine out there. And because of PageRank it is easier in Google than any other search engine. Visit Stanford and you'll see the Google founder's professor even discussing how flawed this algo is.

Have you not heard about the bombs people setup to manipulate Google results? Maybe you would like to read this article:

http://searchenginewatch.co...

I can create an empty or frames based page in Google (even use Javascript redirections), point a couple of links from some of my other domains to it and I'll get to the top position every time. Yes it's that easy. Of course as Anthony has pointed out here, the pages could get dropped just as your non Google manipulated pages can get dropped.

Google finally changed their 2004 Google - Searching 4,285,199,774 web pages to 2004 Google - Searching 8,058,044,651 web pages on November 10th as you mentioned. But Google also now includes a link explaining what this 8 billion+ figure means:

http://www.google.com/googl...

According to Google on this very page, their previous figure of 4,285,199,774 web pages was accurate and that was indeed the total number of available Web pages for searching. So again it appears Anthony was correct. Google did not increase the total number of available documents since Aug 2003.

According to the Google author of this page, 8 billion pages is a milestone worth noting. What they should have added was that they just added 4 billion worthless documents to their index that you'll never see. The results are just as terrible as they were with 4 billion. It's the same sites at the top and will continue to be the same pages at the top. Why? Because Google's PageRank is a seriously flawed algorithm that is simple to manipulate. Their index still presents empty results and the SEO people who know how to manipulate Google either through link manipulation or PR theft, will continue to be the same sites you see day in and day out.

Why do I like Google? Because I can rank in the top 10 on any term I want. I can steal the PR from any page. I can feed Google what Google wants to see, but feed my human visitor anything else I want. I can use a META Refresh to point Google to a high PR and steal it, but give my visitor my stuff. So instead of ranking on my efforts, all I do is steal your PR. Much easier in Google than any other search engine. With the other engines I must at least do some keyword stuffing, but that's getting harder to do. PageRank is there for your taking and for those competitive search terms, they're PR is being stolen every day.
11/13/04 11:42:24
Chris Rutledge wrote:
Over 8 Billion NOw

Ha Ha
11/14/04 11:51:04
atul (http://geocities.com/atul_bnd) wrote:
precisely now searching..

</quote>
©2004 Google - Searching 8,058,044,651 web pages
</quote>

google is damn quick ..
11/17/04 06:17:25
Google Sucks! wrote:
Reply to:atul (http://geocities.com/atul_bnd)



Google should just remove this stupid and inaccurate statement. Will this EXACT figure be displayed on their home page for another year? Why don't they just say "over 8 billion"? Instead stupid Google gives you an exact figure of 8,058,044,651 web pages which is NOT CORRECT and very misleading! Google is evil.

Personally I use any of the Yahoo! Network search engines to search. Why? Because Yahoo! handles different languages much better. Yahoo! provides a higher relevancy to my search query. If Google applied their Adsense technology (which they admit they stole from Overture ) to search results they too would have a better engine. Google's PageRank is the most rediculous and easily manipulated algo ever used. Not to mention their empty results (no title and no description) is just a waste of my time.

I think I'll steal someone's Google PageRank today!
11/17/04 11:57:49
Arius wrote:
Be sure to check out this article on what makes Google tick:
http://www.zdnet.com.au/ins...

One quote was they can "double performance by doubling the hardware" they throw at the indexing problem.

What do you think?
http://www.reidtechnologies.ca
12/03/04 11:38:06
Google sucks! wrote:
Anthony Federico above on 09/28/04 21:46:00 said that this quey:

+the

on Google returns about 5,700,000,000. Today that same query now only produces this statement:

about 2,890,000,000 for +the.

This only proves a couple of things. One, the difference in the 4.2 billion as Anthony showed us as being Google's max, shows that anything over this amount is coming from the suplimental index. Which is a completely different index and if you are in one of those indexes, your Web page is forever lost. Or two, Google's returned figure is as bogus as Google's statement of having 8,058,044,651 web pages indexed.
03/24/05 23:47:51
Helen wrote:
Dear People

I have been given the task of getting links for our websites that have good page rank on the links directories.
In addition we have many categories so your site will be place on an appropriate page.

If you would like to trade links please send me your website details.
If you are not the right person please pass this on to your webmaster.

Best Regards,
Helen Williams
08/11/05 04:51:56
Marry wrote:
Hi there

I have been given the task of getting links for our websites that have good page rank on the links directories. In addition we have many categories so your site will be place on an appropriate page.

If you would like to trade links please send me your website details.

P.S.: I got your e-mail publicly listed on your webpage . Our apologies if you do not wish to take part in a link exchange.

Best Regards,
Marry Tailor
09/16/05 05:32:53

Link This Article

Code:

Preview:

Is Google Broken?