Hardware Software Tradeoffs

There are many methods to reduce hardware cost. One method is to integrate the communication assist and network less tightly into the processing node and increasing communication latency and occupancy.

Another method is to provide automatic replication and coherence in software rather than hardware. The latter method provides replication and coherence in the main memory, and can execute at a variety of granularities. It allows the use of off-the-shelf commodity parts for the nodes and interconnect, minimizing hardware cost. This puts pressure on the programmer to achieve good performance.

Relaxed Memory Consistency Models

The memory consistency model for a shared address space defines the constraints in the order in which the memory operations in the same or different locations seem to be executing with respect to one another. Actually, any system layer that supports a shared address space naming model must have a memory consistency model which includes the programmer’s interface, user-system interface, and the hardware-software interface. Software that interacts with that layer must be aware of its own memory consistency model.

System Specifications

The system specification of an architecture specifies the ordering and reordering of the memory operations and how much performance can actually be gained from it.

Following are the few specification models using the relaxations in program order −

Relaxing the Write-to-Read Program Order − This class of models allow the hardware to suppress the latency of write operations that was missed in the first-level cache memory. When the write miss is in the write buffer and not visible to other processors, the processor can complete reads which hit in its cache memory or even a single read that misses in its cache memory.
Relaxing the Write-to-Read and Write-to-Write Program Orders − Allowing writes to bypass previous outstanding writes to various locations lets multiple writes to be merged in the write buffer before updating the main memory. Thus multiple write misses to be overlapped and becomes visible out of order. The motivation is to further minimize the impact of write latency on processor break time, and to raise communication efficiency among the processors by making new data values visible to other processors.
Relaxing All Program Orders − No program orders are assured by default except data and control dependences within a process. Thus, the benefit is that the multiple read requests can be outstanding at the same time, and in program order can be bypassed by later writes, and can themselves complete out of order, allowing us to hide read latency. This type of models are particularly useful for dynamically scheduled processors, which can continue past read misses to other memory references. They allow many of the re-orderings, even elimination of accesses that are done by compiler optimizations.

The Programming Interface

The programming interfaces assume that program orders do not have to be maintained at all among synchronization operations. It is ensured that all synchronization operations are explicitly labeled or identified as such. Runtime library or the compiler translates these synchronization operations into the suitable order-preserving operations called for by the system specification.

The system then assures sequentially consistent executions even though it may reorder operations among the synchronization operations in any way it desires without disrupting dependences to a location within a process. This allows the compiler sufficient flexibility among synchronization points for the reorderings it desires, and also grants the processor to perform as many reorderings as allowed by its memory model. At the programmer’s interface, the consistency model should be at least as weak as that of the hardware interface, but need not be the same.

Translation Mechanisms

In most microprocessors, translating labels to order maintaining mechanisms amounts to inserting a suitable memory barrier instruction before and/or after each operation labeled as a synchronization. It would save instructions with individual loads/stores indicating what orderings to enforce and avoiding extra instructions. However, since the operations are usually infrequent, this is not the way that most microprocessors have taken so far.

Overcoming Capacity Limitations

We have dicussed the systems which provide automatic replication and coherence in hardware only in the processor cache memory. A processor cache, without it being replicated in the local main memory first, replicates remotely allocated data directly upon reference.

A problem with these systems is that the scope for local replication is limited to the hardware cache. If a block is replaced from the cache memory, it has to be fetched from remote memory when it is needed again. The main purpose of the systems discussed in this section is to solve the replication capacity problem but still providing coherence in hardware and at fine granularity of cache blocks for efficiency.

Tertiary Caches

To solve the replication capacity problem, one method is to use a large but slower remote access cache. This is needed for functionality, when the nodes of the machine are themselves small-scale multiprocessors and can simply be made larger for performance. It will also hold replicated remote blocks that have been replaced from local processor cache memory.

Cache-only Memory Architectures (COMA)

In COMA machines, every memory block in the entire main memory has a hardware tag linked with it. There is no fixed node where there is always assurance to be space allocated for a memory block. Data dynamically migrates to or is replicated in the main memories of the nodes that access/attract them. When a remote block is accessed, it is replicated in attraction memory and brought into the cache, and is kept consistent in both the places by the hardware. A data block may reside in any attraction memory and may move easily from one to the other.

Reducing Hardware Cost

Reducing cost means moving some functionality of specialized hardware to software running on the existing hardware. It is much easier for software to manage replication and coherence in the main memory than in the hardware cache. The low-cost methods tend to provide replication and coherence in the main memory. For coherence to be controlled efficiently, each of the other functional components of the assist can be benefited from hardware specialization and integration.

Research efforts aim to lower the cost with different approaches, like by performing access control in specialized hardware, but assigning other activities to software and commodity hardware. Another approach is by performing access control in software, and is designed to allot a coherent shared address space abstraction on commodity nodes and networks with no specialized hardware support.

Implications for Parallel Software

Relaxed memory consistency model needs that parallel programs label the desired conflicting accesses as synchronization points. A programming language provides support to label some variables as synchronization, which will then be translated by the compiler to the suitable order-preserving instruction. To restrict compilers own reordering of accesses to shared memory, the compiler can use labels by itself.