Deconstructing Google

There are quite a few hits for the text “Reverse engineering Google” on Google. The ones I have looked at try to guess how to get attention from Google’s robot and how to get position is search results. This note is about an entirely different question:
“How does Google answer questions so efficiently?”.

I have heard that Google employs about 14,000 Pentium III processors (late 2002), probably each having upwards of 1GB of RAM (the latter is my guess). I have heard that search engines build a slowly changing data base that records facts such as “Word xxx occurs at offset nnn in document zzz.” and that this information, coded somehow, composes the bulk of the data that is used to identify the appropriate pages. I have heard that a given Pentium specializes in a particular set of sites and each site is covered by just one Pentium. Here is a way to compress and organize such data so as to make Google searches plausible, perhaps even possible.

Word frequency is critical. 50 words like “of” might be omitted from the data base. Some words appear just once on the whole web. With misspellings I would guess that there are about 500,000,000 distinct words on the web. I would guess that 95% of words, by count of occurrence, are from a vocabulary of 5000 words at least if Google is sensitive to language.

See some C code for my speculation about the inner loop for what I imagine to be the most time consuming questions: Searching for documents with both of two fairly common words. The byte arrays thru which cp and dp scan are each for some particular word W in some web neighborhood. The sum S of the first k such bytes of the array is the dense code for the Sth page in this neighborhood with at least one occurrence of W. Well there is one caveat: when that sum is 0 mod 256, then there is no designated page. The while is the inner loop and about 6 instructions are required for each page in the web with one word but not the other. A dense code is an integer of about 21 bits that is an index into an array of pages. Elsewhere there are data structures that can get to the page from the dense index.

This is a pretty good data structure for words that appear at least once in every 200 pages. For the next less frequent range of words 16 bit bytes would do. I will call these two byte words. For the common one byte words, we can build an array, the nth element of which indexes into the big array to get immediately to the place where the sum is n2¹². We then need not pass over every array element for the more common word.

There remains the significant problem of preferring those pages where the two words are near or adjacent. The data described so far does not help. For some searches it would be feasible to search the page or some compression thereof. If we want to do better we might build for each word in each page: a list of positions of that word within the page. The position would be by word count, not character count. The list would again be coded by delta entry as above. These lists would be typically very short. A modification of the loops described would detect when the two sought words occurred near each other within a given page.

New notes as of 2010; rather like above.

Consider words that appear on the web just once. I suspect that there are roughly a billion of these, plus or minus an order of magnitude. There is at least one such word on my web site and most search engines can locate it. If I ask Google about this word I presume it consults a list, CWL, of roughly one million words and, if found, substitutes a 20 bit index for the word. I presume that CWL is replicated in every Google machine that examines raw queries. If a query word is not in CWL I would presume that the word is forwarded to some machine that specializes in obscure words.

How many such specialist machines there are in Google’s system depends, I presume, on the frequency of obscure words in queries. These specialists may be collocated and share RAM, or may be distributed to save transmission and latency costs. Hashing techniques and NUMA at one geographical site would serve Google’s entire obscure word load I would guess. Transmission and latency costs are minuscule. Typing errors and plain misspellings in queries are probably the foremost source; yet Google dutifully reports the sites that spell it as you ask, along with the sites you probably wanted that spell the word right.

Perhaps Google is savvy of common word pairs such as “good bye”. It would seem more productive to view “new york” as a single word than “mulct” for instance.