There are quite a few hits for the text “Reverse engineering Google” on Google. The ones I have looked at try to guess how to get attention from Google’s robot and how to get position. This note is about an entirely different question:
“How does Google answer questions so efficiently?”.

I have heard that Google employs about 14,000 Pentium III processors (late 2002), probably each having upwards of 1GB of RAM (the latter is my guess). I have heard that search engines build a slowly changing data base that records facts such as Word xxx occurs at offset nnn in document zzz. and that this information, coded somehow, composes the bulk of the data that is used to identify the appropriate pages. I have heard that a given Pentium specializes in a particular set of sites and each site is covered by just one Pentium. Here is a way to compress and organize such data so as to make Google searches plausible, perhaps even possible.

Word frequency is critical. 50 words like “of” might be omitted from the data base. Some words appear just once on the whole web. With misspellings I would guess that there are about 500,000,000 distinct words on the web. I would guess that 95% of words, by count of occurrence, are from a vocabulary of 5000 words at least if Google is sensitive to language.

See some C code for my speculation about the inner loop for what I imagine to be the most time consuming questions: Searching for documents with both of two fairly common words. The byte arrays thru which cp and dp scan are each for some particular word W in some web neighborhood. The sum S of the first k such bytes of the array is the dense code for the Sth page in this neighborhood with at least one occurrence of W. Well there is one caveat: when that sum is 0 mod 256, then there is no designated page. The while is the inner loop and about 6 instructions are required for each page in the web with one word but not the other. A dense code is an integer of about 21 bits that is an index into an array of pages. Elsewhere there are data structures that can get to the page from the dense index.

This is a pretty good data structure for words that appear at least once in every 200 pages. For the next less frequent range of words 16 bit bytes would do. I will call these two byte words. For the common one byte words, we can build an array, the nth element of which indexes into the big array to get immediately to the place where the sum is n212. We then need not pass over every array element for the more common word.

There remains the significant problem of preferring those pages where the two words are near or adjacent. The data described so far does not help. For some searches it would be feasible to search the page or some compression thereof. If we want to do better we might build for each word in each page: a list of positions of that word within the page. The position would be by word count, not character count. The list would again be coded by delta entry as above. These lists would be typically very short. A modification of the loops described would detect when the two sought words occurred near each other within a given page.