Thesis (Ph. D.)--University of Rochester. Dept. of Electrical and Computer Engineering, 2016.
Energy efficiency and reliability are two main concerns in future many-core
systems design. On the one hand, technology scaling makes it possible to integrate an
unprecedented number of transistors on a single chip, enabling many-core system
design. On the other hand, this integration introduces energy and reliability issues for
the entire system. Moreover, as the speed gap between processor and external
memory increases, large-volume and high-density caches are required, worsening
reliability and energy issues further. Supply voltage scaling is one of the most
effective ways to reduce system energy consumption at the cost of performance and
reliability. Low supply voltages increase the impact of process and other variations on
circuit functionality and performance. Eventually, the system will fail below a
minimal supply voltage VDDMIN. Typically, the cache arrays in the system limit VDDMIN
of the whole processor. Consequentially, it is necessary to provide fault tolerance for
the cache under low supply voltages in order to improve system energy efficiency
while maintaining reliability.
We can classify cache errors as hard or soft errors. Hard errors may be caused
by manufacturing defects, threshold or supply voltage variations, or device aging, and
soft errors are introduced by external particle strikes or other random noise.
Traditionally, most soft errors manifest as single event upset. However, as we
approach into the nanometer era, the probability of multi-bit upset increases
significantly because a single particle strike can cause more cache cell upsets. To
address both single bit upset and multi-bit upset, we propose two-layer error control
codes, combining the error detection capability of a rectangular code and the error
correction capability of a Hamming product code in an efficient way, to significantly
improve system reliability while maintaining low area, power, and latency overhead.
To reduce supply voltage beyond normally acceptable VDDMIN and maintain
appropriate yield and reliability, we exploit existing double-error correcting tripleerror
detecting (DECTED) codes, together with cache line disabling in an efficient
way to handle both hard and soft errors. The proposed method uses DECTED codes
for each cache line—1-bit error correction for hard errors, and the other 1-bit error
correction for soft errors. When there are multiple faulty cells, the cache lines will be
disabled. This approach can reduce supply voltage beyond normally acceptable
VDDMIN and maintain appropriate yield and reliability. To further improve energy
efficiency, an adaptive fault-tolerant cache architecture, which provides appropriate
error control capability for each cache line based on the number of faulty cells
detected, is proposed. We use single-error correcting double-error detecting
(SECDED) codes for each cache lines to address soft errors, and extra parity bits are
used when there are hard errors. Our experimental results show that the proposed
method can further reduce supply voltage and increase cache reliability.
We also propose a two-layer error control code, combining error detection
capability of rectangular codes and error correction capability of Hamming product
codes in an efficient way, in order to increase cache error resilience for many core
systems, while maintaining low power, area and latency overhead. Based on the fact
of low latency and overhead of rectangular codes and high error control capability of
Hamming product codes, two-layer error control codes employ simple rectangular
codes for each cache line to detect cache errors, while loading the extra Hamming
product code checks bits in the case of error detection; thus enabling reliable largescale
cache operations. Analysis and experiments are conducted to evaluate the cache
fault-tolerant capability of various existing solutions and the proposed approach. The
results show that the proposed approach can significantly increase Mean-Error-To-
Failure (METF) and Mean-Time-To-failure (MTTF) up to 2.8×, reduce storage
overhead by over 57%, and increase instruction per-cycle (IPC) up to 7%, compared
to complex four-way 4EC5ED; and it increases METF and MTTF up to 133×,
reduces storage overhead by over 11%, and achieves a similar IPC compared to
simple eight-way SECDED. The cost of the proposed approach is no more than 4%
external memory access overhead. In order to improve system reliability in the
scenario of cache coherence protocol, two different approaches are proposed: prewrite-
back policy and uneven error-protection. Pre-write-back cache policy can
reduce the number of cache lines with “irrecoverable” cache states, and uneven errorvii
protection provides appropriate error control mechanisms for each cache line based
on its cache state. Our analysis and experimental results show that the proposed
uneven error-protection approach with pre-write-back policy can improve system
reliability significantly.