Farm Development

Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

I came across this article today on Coding Horror about how Google has a monopoly on search engines and how something must be done about it. I'm not one who falls into the "Google Is Evil" camp; I actually think they are a benevolent force in the world :) However, as with any monopoly, the lack of competition stifles progress. And when I think about the state of today's technology, I can't help but wonder why Google has not fixed the most fatal flaw in their Googlebot :

It does not behave like a web browser.

Search engines are made for people and the majority of people browse the Internet with a web browser. The first comment on the article is a cry for help: "What can we do?" I have an answer to that question. And you can take my answer and turn it into a business plan and climb the golden staircase to success. Any smart investor would be begging you to take their money. Google generated $5.37 billion dollars in Q2 of 2008 and their flagship product doesn't even work! In fact, I'm going to give this to you all for free; all I ask is that you visit me one day and say thanks. Are you ready?


If you build what I am about to propose, Google would soil their pants. You would invoke the mighty forces of the free market and perhaps Google will fix their own Googlebot. That is, if they don't buy yours first.

Before I get into the technical details, let's consider the problem. Here is a little hobby site I built for browsing and listening to records on eBay: http://aintjustsoul.net/ As you click around you'll notice it uses lots of sexy Ajax to load images and play sounds. This is good for users with web browsers but not good for Googlebots. In fact, before I added static representations of content, you could not find my website by searching Google. As a webmaster (I love that word), I should not have to produce a static, non-sneaky version of my site just for Google. Humans can already use it, right?! Whether Google wants to admit or not, Ajax enabled websites are the future of the Internet. There are just so many usability issues that can be solved with Ajax. Gmail is the best example of how Ajax improves user experience.

If Google doesn't learn how to crawl the web like a real person IT WILL FAIL.

Here is my recipe for a browser-like search indexer. I'll sheepishly point out that I gave myself two hours to build a prototype of this and I failed. However, I am confident that someone with more experience fighting cross-domain browser limitations could build one in two hours or less! That is your challenge. Digg this, slashdot it, do whatever it takes. This is how you can help.

Ingredients:

  1. Indexer: A server that accepts a POST with two parameters: url, link_clicked, and text. This service saves data for the search index. The link_clicked would be the text of a link that might have been clicked while at the given URL (the problem with URLs that do not change is that there is no way to send a person back to the page from a search engine; however, people use anchor based navigation to work around this).
  2. Crawler: An HTML file that you can load in a web browser and give it a URL to start at. It loads the page, posts the text to the Indexer then clicks each link, posting a "snapshot" after each click.
  3. Database: A very big database. I'd suggest the Amazon Simple Storage System (S3).
  4. Grid: A way to run many web browsers in parallel, like, at least 1,000 at once. The Web is big but don't let that intimidate you! I'd suggest the Amazon Elastic Compute Cloud (EC2) and taking a look at setting up Selenium Grid on EC2 for ideas on how to automate web browsers. The Windmill project may also be useful. The Saucelabs Selenium service might even be great for this.

There you have it. Using these ingredients, I cannot see any technical limitations to building a search engine indexer that behaves like a real web browser. The Crawler is a little complicated so I'll point out some approaches. Conceptually, you want to do something like this (a JavaScript example using jQuery):

$(document).ready(function() {
    load_url("http://aintjustsoul.net/");
    take_snapshot();
});

function take_snapshot(url, link_clicked) {
    // save the text:
    $.post(
        "/path/to/indexer", {
            url: window.location.href, // includes the hash #
            link_clicked: link_clicked && link_clicked.text(),
            text: $("body").text()
        });
    // For every <a> tag (a link), click it and take another snapshot.
    // Note that this query will probably need to be done on an iframe (see below).
    $("a").each(function(link_clicked) {
        take_snapshot(url, link_clicked);
    });
};

function load_url(url) {
    // FIXME: cross-domain compatible Ajax load (see below)
};

This code obviously will not work as is. Mainly because the cross-domain security will force you to load into an iframe or some similar approach. But you get the idea. There are several solutions to the cross-domain issues; one is detailed here using iframes and the dojox.io module solves it like this:

dojox.io.xhrPlugins.addCrossSiteXhr("http://aintjustsoul.net/");
dojo.xhrGet({url:"http://aintjustsoul.net/", ...});

If you want to be boring you could make your own proxy server in Python (or whatever) that loads URLs locally and passes through the content. (It would be slightly more exciting if the proxy had a bacon feature, like http://bacolicio.us/.) The take_snapshot() function would then need to pull some tricks to rewrite link URLs before clicking on them.

I'm still convinced this is easy. I have no idea why Google isn't doing it already. Some other things to consider: You'd probably end up wrestling a little bit with popup windows and JavaScript conflicts, but window.onerror() can help you log these problems for analysis. You'd need a comprehensive browser farm. But Firefox is a great place to start since you can run it cheaply on EC2 using Linux. Most sites seem to work in Firefox these days so it might even be sufficient enough for indexing purposes.

  • Re: Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

    When I visit http://aintjustsoul.net/, all I see is a broken site. Then I temporarily allow JS and it errors about Flash. Then, nothing. I click around a dozen of times and don't get it, JS goes back to blocked and bye.

    Textual content seems to have a much higher ROI than anything else right now. And their massive multimedia browser farm is called youtube ;)

  • Re: Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

    Oops, I don't have much time to work on aintjustsoul.net these days but you can substitute that for many other ajax enabled sites. Also, I think the majority of people use the web with JavaScript turned on (especially for Gmail, Facebook, etc).

    As for textual content, Google is going to miss that content if it doesn't learn how to crawl like a browser. As for youtube being a browser farm, I'm not sure what you mean by that. I have no doubt though that Google has a browser farm, especially for Gmail testing which they do in a custom Selenium Grid (engineers talk about this at conferences like GTAC).

    I also have no doubt that Google is probably already working on a browser-like crawler although if so, why doesn't it exist? ;)

  • Re: Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

    Googlebot does understand at least some CSS and Javascript. Has done so for years.

  • Re: Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

    Kumar (OR ANYONE ELSE): if you can be cashflow-positive in 60 days Mark Cuban might just fund you! See http://blogmaverick.com/2009/02/09/the-mark-cuban-stimulus-plan-open-source-funding/, and good luck ;-)

  • Re: Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

    What's Google's incentive to make their crawler work with sites which don't make their content accessible and how does it compare to the magnitude of your incentive in doing the work instead? In addition, emulating a Web browser is most likely to have their crawler stuck in a "honeypot" as it pulls up endless variations of slightly different pages at the same location - where's the attraction in developing something and have large tracts of your machine farm burning CPU and bandwidth for little real gain?

    The problem with AJAX and all other attempts to make the Web behave like something else is that these technologies seek to deny what makes the Web interesting from the point of view of information publishing and retrieval. If the Web had been high on the AJAX from the start, there'd probably be no Google.

  • Re: Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

    The only way to know if this is going to be worth it is to look at the numbers of how many pages are missed by current technologies, and who has those numbers best? Google. If it is worth it, they'll do it. They already have the technology and infrastructure in place. They already had a browser they have established to run headless across thousands of machines crawling across the internet. That is how their automated tests for Chrome work. They could easy adapt this to compare to their current search results and determine the validity of the approach, and then get a whole new bang for their chromium bucks.

    Or, maybe that was the whole point all along.

  • Re: Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

    Perhaps I am missing something, but doesn't this just crawl "a" tags?

    Isn't the challenge of crawling web applications that plenty of non-a things also provide functionality?

    Which of course brings me to the point that it doesn't make sense to "crawl" web applications. The value of a web application to most users isn't the content it contains, but the interface it provides to that content. For Google to "rank" web applications on merit, it would have to evaluate the interface, not the content. It would be tremendously fascinating to see a search engine that could tell me that GMail was a better email program than Hotmail, by actually evaluating the merits of the application. But I suspect the effort involved wouldn't be worth it, especially since the results would likely be similar to evaluating the inbound links to the respective applications.

    But I appreciate you putting the idea out, just for the discussion opportunity.

  • Re: Googlebot's Fatal Flaw And How You Can Fix It (or Get Rich Trying)

    @Alec Munro: yes, the "a" crawl is only the simplest example; there are many actions a user could perform on a website to see new content. Gmail is a poor example in relation to search engines. There is no reason to index Gmail for search purposes. But imagine how that app behaves then think of an Ajax app that catalogs a book collection. Such a site would be useful to crawl but unless there was a static version, Google could not crawl it.

    @Paul Boddie: I'm not saying this approach is efficient :) And yeah, Chrome has a lot of headless features and I'm sure we'll see some nice lib versions of its fast JS engine for command line usage too. Chrome is probably Google's own answer to this problem, no doubt.

    The "honeypot" problem exists with static sites just the same (Google probably has many algorithm to detect circular traps, doesn't seem too hard).

    As for Ajax making the web less informative, I disagree. I think there are plenty of Ajax implementations that exist purely for "enhanced functionality" (Facebook chat, Google Reader, Gmail, etc) but why compromise functionality for dissemination of information? I think the two can and *should* be combined. Then what does Google do? Throw its hands up and say "we don't care about that information, forget them." The minute someone comes along and builds a search crawler that *can* crawl the information seen in an Ajax app, that crawler will have something that Google does not have.

    It's hard to imagine this scenario now because it's a Catch 22; no one will sacrifice their Google Rank by building a non-googlebot-accessible site just because they can. Still, I am surprised that Google has not rushed towards this goal sooner.

    @Calvin Spealman has the best guess; perhaps Google has an experimental "browser like" crawler and through analysis has determined that there is not enough content their static crawler misses out on. Again, the Catch 22!

Note: HTML tags will be stripped. Hit enter twice for a new paragraph.

Recent Projects

  • JSTestNet

    Like botnet but for JS tests in CI.

  • Nose Nicedots

    Nose plugin that prints nicer dots.

  • Fudge

    Mock objects for testing.

  • Fixture

    Loading and referencing test data.

  • NoseJS

    Nose plugin that runs JavaScript tests for a Python project.

  • Wikir

    converts reST to various Wiki formats.