Section 4.2 of Intel’s “Intel® 64 and IA-32 Architectures Software Developers Manual, Volume 3 (3A & 3B): System Programming Guide” ( here?) describes the IA-64 interpretation by hardware of the 64 bit ‘linear address’ produced by program logic. Implementations are presumed to implement k bits of a linear address and then demand that the high (65−k) bits all be the same. A 64 bit word satisfying this is called ‘canonical’. Non canonical numbers cause faults when used as addresses. The hardware assumes that the software has created a tree of mapping tables. A mapping table is a page (4KB) in size. Each entry in a table is 8 bytes long and provides the address of either a page or another mapping table. The privileged register CR3 holds the physical address of the level 0 mapping table which is the root of the mapping table tree. An entry of a mapping table at level n locates a mapping table of level n+1. There are roundup((k−12)/9) levels of tables. The left k−12 bits of the address is divided into 9 bit portions with perhaps the left most portion less than 9 bits. The portions, from left to right are used to index into the successive levels of the mapping tree. According to bits found in a mapping table entry, the entry may:
gcc a.c -Wmost tbx86.s -fnested-functions time ./a.out 46 P=70368744177664 329992749 malloc: 40599 0 test: 32166 0 test: 29817 0 test: 28152 0 test: 29889 0 test: 54 0 test rep addr: 63 store in area: 2750230224 free: 278035515 malloc: 69075 0 test: 32598 0 test: 29520 0 test: 32643 0 test: 31554 0 test: 45 0 test rep addr: 72 store in area: 2759717952 free: 282390579 malloc: 80118 0 test: 34911 0 test: 31896 0 test: 33075 0 test: 24642 0 test: 63 0 test rep addr: 72 store in area: 2768732793 free: Total clocks = 9169660836 real 0m3.842s user 0m0.001s sys 0m3.840sThe units are processor clock cycles as counted by Intel's ‘Time Stamp Counter. It is hard to see how this might cause disk IO. At least some full executions of the program leave both disk read and disk write counts, as reported by the ‘Activity Monitor’, unmodified. The total cycle count is about 9.6 giga clocks which is about right for the 2.4 GHz CPU. The program can allocate 246 but not 247 bytes. That is about right for a CPU that implements 48 bit virtual addresses. I have not found an authoritative source of such information for the Core 2 Duo in my Mac. 10 Giga clocks (243) for ‘zeroing’ 64 tera bytes would be impressive if it were not lazy, and thus almost entirely avoided. 246 bytes is 234 pages requiring 237 bytes of page tables. My Mac has only 233 bytes of real memory and so they create page tables lazily too as does Keykos. It is plausible that they build all the 228 bytes of 2nd level page tables. I think that they don’t because the Activity Monitor shows no memory usage blip. It is curious that they spend 1.4 Gops freeing memory—nearly 10 times that required to allocate it. Both the malloc and free times are proportional to the size of allocated storage—down to a tiny 100 GB of allocated storage. Time for malloc is proportional too. I suspect that there is an avoidable kernel loop considering some level in the Hierarchy.
In Keykos the segment keeper would need to traverse a tree with 5 full length branches. Each branch would have 8 nodes plus the root node. That’s about 40 nodes. Each slot of each node must be examined because the keeper has no other record of the shape of the tree. That is 640 slot queries each of which might take about 1000 clocks including the query to see if it is a data key. Those 40 nodes would be sold back to the space bank and each sale might require about 2000 clock cycles. That would be about 700,000 clock cycles to sell the segment for scrap. Reclaiming the page tables is a side effect of reclaiming the segment nodes that produced them—already included.
Creating and deleting a segment keeper is something like 10000 clocks. Populating the segment would be one tree branch per fault. That would be about the cost of tearing it down. These are probably somewhat optimistic estimates.