There are CPUs now that include multi threading capability whereby a cache miss does not cause most CPU hardware resources to stand idle.
Other threads can profitably occupy this hardware to make headway.
It is a scandal that today’s kernels (including mine) take so long to create a new thread.
I think a thread sticks to one core with current multi core chips.
Standing back a long way from the problem it would seem that a compiler could identify pieces of code as small as 100 instructions and compile yet to be invented instructions that would create a new thread to perform those instructions concurrently with the main-line as a sub-task.
These new instructions would not normally require intervention of privileged code.
Perhaps it is not fruitful to distinguish between main-line and sub-task.
The above is a pair of very difficult problems—to design a compiler to discover such parallelism, and to design instructions to create threads.
Perhaps it is better to think of it as one problem.
Machines with a small user state, such as the X86, suffer thereby from an inefficient instruction set, but here, small state may be an advantage.
Modern multiprogramming schemes need a ‘thread local storage’ register that provides access to its value in a few cycles.
If registers are numerous it is satisfactory to allocate a general register for this purpose.
General registers are not numerous in X86 architectures.
I propose thus to combine two hard problems; but there is yet another problem to be included:
There are several architectures, such as Infiniband, for high bandwidth data paths between machines (boards with a few chip sockets, each with several cores, each with several threads).
These provide RAM to RAM transfer, perhaps without mediation by the kernel.
Some claim latency of about 1 µsec.
There are proposals for such architecture but they all have problems.
Problems arise in unblocking, or creating some thread to attend to data that has just arrived from another machine.
See Remote Direct Memory Access for instance.
I think that the direct access problem and threading problem are one big messy problem.
This is because a logical thread must frequently yield its CPU assets as the task at hand shifts from issuing instructions to transmission of data.
Another software architecture issue here is the tension between passing values by value or by reference.
C makes passing strings by reference efficient but leaves storage lifetimes up to the programmer.
Passing by reference works only if sender and receiver are in the same address space.
The idea of quickly created threads implicitly assumes a shared address space.
Current threading hardware does not require this.
I do not know what the security situation is in these cases but a common assumption is that one application owns all the larger system.
Even applications need boundaries, however.
Questions:
- Dare we hope for a programming paradigm where passing by value or by reference is unified in the language semantics?
- Functional languages achieve this.
Ocaml is fairly efficient for many applications and it is enough functional to provide this abstraction.
- Need applications be savvy of:
- Cache and Cache identity
- Thread and thread identity
- Core and core identity
- Chip and chip identity
- RAM system and identity
- Yes, but it may not be necessary that the language syntax address these distinctions.
Most cache friendly algorithms do not visibly speak to the cache.
A matrix routine that is cache friendly when applied to a 20 by 20 matrix is likely to be page-fault friendly when applied to a 200 by 200 matrix.
‘Locality of reference’ wins for caches.
Does it win for these other issues?
The above rosy picture ignores several issues.
Several design questions immediately intrude:
- Presumably the hardware that holds a user state (thread) is attached to the only core that can quickly execute instructions in that stream (user state).
Perhaps there are data paths to move a user state from one core to another on the same chip.
- Is the application code aware of affinity of two threads to the same core?
Yes if the compiler invents the thread as envisioned above.
Each ‘machine’ has its own RAM.
Then the division of cores into chips disappears from application purview.
- How much privileged state may be shared between states?
- most can share the address space with the parent.
Any user mode instruction to create a thread would share the address space.
- How to coordinate?
In diamond flow control the hardware resources to hold a thread state should be freed whenever either branch finishes and the other logical processor should take up the ‘continuation’ when it finishes its branch.
- How much user state is needed in a particular thread?
I have implicitly assumed that the application is aware of allocation of machines’ respective RAM to application data.
These ideas also ignore the problem of congestion based migration of work to other machines.
Intel’s hyper-threading technology makes the ‘logical processor’ equivalent to the CPU chip of the previous generation.
The HT hardware is thus suitable to largely unmodified MP kernels.
It is clear that sometimes application logic must initiate data movement or thread movement, and at other times kernel congestion logic must cause this.
How are they to coordinate?
The reader will note that there are more questions here than answers.
Following section 3.2 (p. 65) of Intel’s “Intel® 64 and IA-32 Architectures
Software Developer’s Manual
Volume 1:
Basic Architecture”, we identify several useful ‘user state’ components appropriate to X86 programs:
- Address space
- This is actually a reference to cached physical memory (RAM) and consists of a couple of control registers that locate paging tables in cached real RAM.
- Basic program execution registers
- These are little more than the user state of the early X86 chips: 8 GP registers, program counter and flags.
About 320 bits hold this state, ~1Kb bits for 64 bit programs.
I assume here that the segment registers are not used.
Today most compiler output needs no more state than this.
- x87 FPU registers
- These 640 bits are aliased to the MMX architecture q.v.
- XMM registers
- These hold operands of the SSE instructions.
Size is about 1Kb.
Communicating Data Between Threads
If threads are in the same address space then references will do.
Data lifetime is an issue but this has sort of been solved.
If threads use the same RAM, or caches thereof, then address spaces of respective threads can share RAM subspaces, cached to various degrees.
I don’t think Infiniband claims to move data between caches faster than normal cache snooping.
NUMA is an intermediate step towards isolated RAM.
Kernel software is then necessary to move data closer to the processing, not for logical reasons, but for performance.
Isolated RAM with some explicitly programmed data transmissions using hardware such as Infiniband is possible.
All of these plans impact application design if their performance advantages are to be achieved.