Recovery scheme for hardening system on programmable chips
Abstract:
The checkpoint and rollback recovery techniques enable a system to survive failures by periodically saving a known good snapshot of the system’s state, and rolling back to it in case a failure is detected. The approach particularly interesting for developing critical systems on programmable chips that today offers multiple embedded processor cores, as well as configurable fabric that can be used to implement error detection and correction mechanisms. This paper presents an approach that aims at developing a safety- or mission-critical systems on programmable chip able to tolerate soft errors by exploiting processor duplication to implement error detection, as well as checkpoint and rollback recovery to correct errors in a cost-efficient manner. We developed a prototypical implementation of the proposed approach targeting the Leon processor core, and we collected preliminary results that outline the capability of the technique to tolerate soft errors affecting the processor’s internal registers. This paper is the first step toward the definition of an automatic design flow for hardening processor cores (either hard of soft embedded in programmable chips, like for example SRAM-based FPGAs.