Map Reduce and Such

I take this paper as the official introduction to Google’s MapReduce system. Below I mean by bag an unordered ‘set’ of values, but allowing repeats of values.

They describe the function map as a function provided by the user that takes a key-value pair and returns a bag of key-values. The role of the ‘argument’ parameter to map is unclear to me. Is the map routine required to pay attention to its value? Perhaps the unstated assumption is that the ultimate purpose is to compute something that depends on a ‘bag’ of a great many pre-existing key-value pairs. I cannot imagine how the type of the bag elements bears on the logic of MapReduce. Perhaps the key of each output pair of map must be the same as the key of its argument.

The map function in their first sample code in section 2.1 indeed ignores its key argument. So far the paper has not described how to install these two functions into the hardware which must somehow determine the arguments to map. The sample implies that MapReduce does decimal arithmetic on the intermediate values. That seems bizarre.

Ah! We come to the mapreduce specification object where we learn that MapReduce assumes that map is taking its input from files.

I have seen no use of the key argument to map in any of the tutorials.

I think that the magic is to distribute code so as to be near the data, which today is most likely on many disks each attached directly to computers that can execute such code. Then route the results as intermediate keyed packets via topology savvy mechanisms to places where packets with the same key come together for subsequent application processing. A final typically insignificant stage collects the yields of this subsequent processing. That it can transparently hide hardware failures is an extra strategic advantage.

MapReduce is a peculiar sort of magic. It cannot be adequately described as a normal software abstraction device. It cuts across abstractions in an unconventional manner. That is why it can be so effective. An implementation that gains some of the possible advantages would be deep in bed with packet routing decisions where most of the performance advantages will arise for some applications. I suspect that MapReduce optimizes data transmission within the data center for most applications where the cost of such transmission is significant. MapReduce is a framework in which to efficiently map a significant number of applications on to the massive computer hardware that we know how to build today.

Distributed in Time (batched web crawl)

There is an interesting significant variation on MapReduce that is enough like it to allow some common interfaces. I will call it “time-MapReduce”. The idea is to distribute the application code not across machines and space, but across time, and perhaps machines. There are two variations:

We have one or a few machines that crawl the web, or some such large varying data collection. In the case of the web these machines parse the html into DOM like data structures such as employed in browsers. Then each application from a set of applications that have been submitted since the last web-crawl, is run with the DOM like structure mapped read-only into its address space. These applications also learn the URL via which the material was fetched, and the time of fetching. The cost of fetching and parsing the web data is thus amortized over several applications.

There are many disks full of data at one site. For each disk there are just a few CPU’s with efficient access to that disk. Periodically the system scans the disk content and gives control to a set of applications, running on the disk-local CPU, that have read-only in memory access to the data. For common disk technology the period of this scan might be hours. The cost of reading the disk and perhaps decompression is amortized. The cost reduction is even more dramatic if the storage medium is modern magnetic tape whose access is inherently linear.

In either variation the application can emit a ‘packet’ that will be efficiently accumulated, perhaps sorted, and routed to a second later phase of the same application which can then report results to the customer.

With an equal investment in tape cartridges and drives (100 c/d), the scan time is about 300 hours or two weeks. A petabyte costs about $20,000 of cartridges.

A web crawl machine costs about $10³ and lasts 10⁸ seconds. A context switch costs 10³ machine cycles or 10⁻⁶ seconds. The cost of one app looking at one page, not including processing the page is about $10⁻¹¹. One cent pays for an app to glimpse the entire web. Latency is days, however, but samples come very soon. This cost excludes:

Shared overhead for crawling the web
Profit
Power and Cooling
Scanning the crawled page’s DOM
Sending app specific report to central processing point for 2nd phase of ap.
Running 2nd phase of ap.

Degrees of Privacy

The customers of these applications will often want some privacy of their work. The CIA would require a very high degree of assurance that other customers would gain no knowledge of the nature to their queries. It would seem that they would have to trust the operators of the service.

In the spirit of capability design the applications have no access to anything except as listed here:

RO access to the code that defines the application logic and any language support code.
RO access to the parsed and perhaps raw page data.
RO access to metadata:
- URL,
- IP address,
- Fetch time,
- Relevant http header information.
RW access to an initially zero segment that will be delivered to the 2nd application phase.

Each RW segment will be delivered, sorted by IP address to the 2nd phase of the application together with metadata. Under broad circumstances this sorted data can be shipped directly to the customer without further processing. Note that the first pass application code has no access to the time clock, only the time that the page was fetched. This precludes a complex set of timing attacks designed to analyze the behavior of other applications. Sorting the data precludes other such attacks.