Reliable ultra-low-voltage cache design for many-core systems

Zhang, Meilin

UR Research > Electrical and Computer Engineering Department > Electrical and Computer Engineering Ph.D. Theses >

Reliable ultra-low-voltage cache design for many-core systems

URL to cite or link to: http://hdl.handle.net/1802/30887

Zhang_rochester_0188E_11199.pdf 6.40 MB (No. of downloads : 285)

PDF of thesis.

Description

Thesis (Ph. D.)--University of Rochester. Dept. of Electrical and Computer Engineering, 2016.

Abstract

Energy efficiency and reliability are two main concerns in future many-core systems design. On the one hand, technology scaling makes it possible to integrate an unprecedented number of transistors on a single chip, enabling many-core system design. On the other hand, this integration introduces energy and reliability issues for the entire system. Moreover, as the speed gap between processor and external memory increases, large-volume and high-density caches are required, worsening reliability and energy issues further. Supply voltage scaling is one of the most effective ways to reduce system energy consumption at the cost of performance and reliability. Low supply voltages increase the impact of process and other variations on circuit functionality and performance. Eventually, the system will fail below a minimal supply voltage VDDMIN. Typically, the cache arrays in the system limit VDDMIN of the whole processor. Consequentially, it is necessary to provide fault tolerance for the cache under low supply voltages in order to improve system energy efficiency while maintaining reliability. We can classify cache errors as hard or soft errors. Hard errors may be caused by manufacturing defects, threshold or supply voltage variations, or device aging, and soft errors are introduced by external particle strikes or other random noise. Traditionally, most soft errors manifest as single event upset. However, as we approach into the nanometer era, the probability of multi-bit upset increases significantly because a single particle strike can cause more cache cell upsets. To address both single bit upset and multi-bit upset, we propose two-layer error control codes, combining the error detection capability of a rectangular code and the error correction capability of a Hamming product code in an efficient way, to significantly improve system reliability while maintaining low area, power, and latency overhead. To reduce supply voltage beyond normally acceptable VDDMIN and maintain appropriate yield and reliability, we exploit existing double-error correcting tripleerror detecting (DECTED) codes, together with cache line disabling in an efficient way to handle both hard and soft errors. The proposed method uses DECTED codes for each cache line—1-bit error correction for hard errors, and the other 1-bit error correction for soft errors. When there are multiple faulty cells, the cache lines will be disabled. This approach can reduce supply voltage beyond normally acceptable VDDMIN and maintain appropriate yield and reliability. To further improve energy efficiency, an adaptive fault-tolerant cache architecture, which provides appropriate error control capability for each cache line based on the number of faulty cells detected, is proposed. We use single-error correcting double-error detecting (SECDED) codes for each cache lines to address soft errors, and extra parity bits are used when there are hard errors. Our experimental results show that the proposed method can further reduce supply voltage and increase cache reliability. We also propose a two-layer error control code, combining error detection capability of rectangular codes and error correction capability of Hamming product codes in an efficient way, in order to increase cache error resilience for many core systems, while maintaining low power, area and latency overhead. Based on the fact of low latency and overhead of rectangular codes and high error control capability of Hamming product codes, two-layer error control codes employ simple rectangular codes for each cache line to detect cache errors, while loading the extra Hamming product code checks bits in the case of error detection; thus enabling reliable largescale cache operations. Analysis and experiments are conducted to evaluate the cache fault-tolerant capability of various existing solutions and the proposed approach. The results show that the proposed approach can significantly increase Mean-Error-To- Failure (METF) and Mean-Time-To-failure (MTTF) up to 2.8×, reduce storage overhead by over 57%, and increase instruction per-cycle (IPC) up to 7%, compared to complex four-way 4EC5ED; and it increases METF and MTTF up to 133×, reduces storage overhead by over 11%, and achieves a similar IPC compared to simple eight-way SECDED. The cost of the proposed approach is no more than 4% external memory access overhead. In order to improve system reliability in the scenario of cache coherence protocol, two different approaches are proposed: prewrite- back policy and uneven error-protection. Pre-write-back cache policy can reduce the number of cache lines with “irrecoverable” cache states, and uneven errorvii protection provides appropriate error control mechanisms for each cache line based on its cache state. Our analysis and experimental results show that the proposed uneven error-protection approach with pre-write-back policy can improve system reliability significantly.

Contributor(s):

Meilin Zhang (1982 - ) - Author

Paul Ampadu - Thesis Advisor

Primary Item Type:

Thesis

Identifiers:

Local Call No. AS38.698

Language:

English

Subject Keywords:

Cache; Many-Core; Reliability; Ultra-Low-Voltage

Sponsor - Description:

National Science Foundation (NSF) - ECCS-0925993; ECCS-0903448; ECCS-0954999

First presented to the public:

5/14/2017

Originally created:

2016

Date will be made available to public:

2017-05-14

Original Publication Date:

2016

Previously Published By:

University of Rochester

Place Of Publication:

Rochester, N.Y.

Citation:

Extents:

Number of Pages - xxii, 212 pages

Illustrations - illustrations (some color)

License Grantor / Date Granted:

Catherine Barber / 2016-05-31 11:20:09.695 ( View License )

Date Deposited

2016-05-31 11:20:09.695

Submitter:

Catherine Barber

All Versions

Thumbnail	Name	Version	Created Date
	Reliable ultra-low-voltage cache design for many-core systems	1	2016-05-31 11:20:09.695

Reason for withdraw :*
Display metadata:
Withdraw all versions: