Hardware techniques to reduce communication costs in multiprocessors

Huh, Jaehyuk

Hardware techniques to reduce communication costs in multiprocessors

Access full-text files

huhj17875.pdf (736.39 KB)

Date

2006

Authors

Huh, Jaehyuk

Abstract

This dissertation explores techniques for reducing the costs of inter-processor communication in shared memory multiprocessors (MP). We seek to improve MP performance by enhancing three aspects of multiprocessor cache designs: miss reduction, low communication latency, and high coherence bandwidth. In this dissertation, we propose three techniques to enhance the three factors: shared non-uniform cache architecture, coherence decoupling, and subspace snooping. As a miss reduction technique, we investigate shared cache designs for future Chip-Multiprocessors (CMPs). Cache sharing can reduce cache misses by eliminating unnecessary data duplication and by reallocating the cache capacity dynamically. We propose a reconfigurable shared non-uniform cache architecture and evaluate the trade-offs of cache sharing with varied sharing degrees. Although shared caches can improve caching efficiency, the most significant disadvantage of shared caches is the increase of cache hit latencies. To mitigate the effect of the long latencies, we evaluate two latency management techniques, dynamic block migration and L1 prefetching. However, improving the caching efficiency does not reduce the cache misses induced by MP communication. For such communication misses, the latencies of cache coherence should be either reduced or hidden and the coherence bandwidth should scale with the number of processors. To mitigate long communication latencies, coherence decoupling uses speculation for communication data. Coherence decoupling allows processors to run speculatively at communication misses with predicted values. Our prediction mechanism, called Speculative Cache Lookup (SCL) protocol, uses stale values in the local caches. We show that the SCL read component can hide false sharing and silent store misses effectively. We also investigate the SCL update component to hide the latencies of truly shared misses by updating invalid blocks speculatively. To improve the coherence bandwidth, we propose subspace snooping, which improves the snooping bandwidth with future large-scale shared-memory machines. Even with huge optical bus bandwidth, traditional snooping protocols may not scale to hundreds of processors, since all processors should respond to every bus access. Subspace snooping allows only a subset of processors to be snooped for a bus access, thus increasing the effective snoop tag bandwidth. We evaluate subspace snooping with a large broadcasting bandwidth provided by optical interconnects.