Big Stack Frames

This is about safe allocation of stack frames. One limit to the size of a stack frame is the size of the address space. Safe compiled code should not use the low 32 bits of a calculated frame size to ask for more stack space. Here is an economical way to answer the more frequent question concerning availability of stack space in the prelude of a routine. We suppose that the kernel can invalidate one page of addresses at the end of the space for each stack in an address space. It may suffice to make such a page read-only, if that is easier. A routine’s prelude presumably stores into a newly allocated frame. If the stack frame size is known at compile time to be less than a the size of a page then there is no chance that a frame allocation will skip over the invalid page at the end of stack space. When the compiler cannot be sure that the new frame is less than the size of a page then it can spend extra instructions in the prelude to ask if there is space. But how does the compiled code know the end of the stack space? This is the interesting question of thread local storage. Thread local storage could hold that limit but how is that storage found? When several threads share all or nearly all of an address space then where is the limit found against which the stack pointer is compared to detect stack overflow? How does the compiled code know the end of its stack space? This would seem to require one of:

a register allocated conventionally during calls, devoted to locating where the stack end is stored, and perhaps additionally other “thread” parameters. This may practically require always allocating that register. Allocating such a register impacts the calling convention for an architecture which has become difficult to do after software has begun to be developed for a new machine; such conventions are fixed early these days for new architectures.
For uniprocessor multi programming a location fixed at load time can locate thread local storage. The kernel can modify this location upon thread switch.
The page that appears at some virtual address, set at load time, accesses a different physical page for each thread. This address can now live in the instruction stream. This does for the thread what the multi-processor 370 hardware did for each 370 CPU that might be obeying unmapped privileged code; each CPU saw a different page zero. See the 370’s prefix register. This provides thread local storage if the kernel provides this function via its control of the memory map. This costs little in the way of address mapping overhead for most hardware architectures for most of the maps are still shared among these several related spaces. This precludes TLB sharing, however, in those machines whose TLBs can remember more than one space at a time. (Among common modern architectures, only x86 fails to do so.)
A kernel call to fetch a value unique to this thread, such as a pointer to the thread’s local storage. This can be a minimal call.
A new special hardware register, readable by user code optimizes the above. The privileged code is in a position to load this register upon context switch. Accessing this register need not be fast; a whole instruction can be specified to do so. The register need not live in the general register file. The alternative is two context switches.

The Wikipedia article describes several API's for this but does not say how they are implemented.

With a bit of coordination with memory trap logic, a civilized report can be made of exhaustion of stack space.

Early PL/I compilers for the 370 allocated such a register for thread specific information and used a “segmented stack”. Every call would test if there was still room in the current stack segment and allocate another segment if necessary. A stub frame was provided in the segment upon such allocation, return to which would deallocate that segment.