Samurai / Flikker: Selectively tolerating errors

We are in the age of “good-enough”, where imperfect and cheap trumps perfect and expensive, e.g., MP3s, Netbooks, IP telephony etc. We know that hardware and software systems experience errors, yet we continue to use these systems for our day-to-say needs. Rather than eradicate errors, our goal is to build robust systems that can tolerate both hardware and software errors and provide acceptable outputs.

We present two systems that embody our philosophy: (1) Samurai, to tolerate software memory-corruption errors in C/C++ programs and, (2) Flikker, to tolerate hardware memory errors introduced by lowering DRAM refresh rates for saving power. Samurai allows programs written in unsafe languages (i.e., C and C++) to continue executing with sound semantics despite memory errors. Similarly, Flikker allows DRAM memory to be refreshed at far lower rates than they are today, thereby reducing power consumption. 

Both Samurai and Flikker emphasize the protection of critical data in programs. We define critical data as any data that cannot be regenerated if the application crashes (i.e., its persistent state) and that is important for the application to produce correct or acceptable outputs. For example, in a word-processing application, the document data would be critical and in a computer game, the score and user-data would be critical. While both Samurai and Flikker require the programmer to explicitly identify critical data, the two systems differ in how they protect the data. Samurai protects critical data by replicating it within the process’s address space, while Flikker protects critical data by allocating it in a separate, high-refresh memory partition. We consider each of the above systems in detail.

Samurai: Protecting Critical Data in Type-Unsafe Languages
Samurai is a runtime system to protect critical data from accidental overwrites due to memory-corruption errors in type-unsafe programs. Samurai assumes that the portion of the program that manipulates the critical data is type-safe, and hence legitimate reads/writes of critical data can be identified in the program. Samurai replaces loads and stores of critical data by cload and cstore operations respectively. This can be done either automatically through the compiler or manually by the programmer. Further, the protection provided by Samurai is currently limited to heap objects (i.e., dynamically allocated data).

The goal of Samurai is to prevent illegitimate pointer writes in the program from overwriting critical heap data. It probabilistically achieves this goal by replicating critical heap objects at random locations in the heap in order to minimize the probability of correlated corruptions of the replicas. The information about the replicas of an object are stored as part of its meta-data, which is used by cstore and cload operations to update and compare the replicas respectively. Mismatches detected during the comparison operation are corrected using majority voting among the object’s replicas. Figure 1 shows the operation of Samurai.

Figure 1: Samurai system for protecting critical data

We have deployed Samurai on four applications from the SPEC Benchmark suite, plus a ray-tracing program and a webserver. We have also protected critical data in collection classes and memory allocation libraries using Samurai. Our results show that the overhead of Samurai is within 10% for most applications and libraries considered. Further, we find through fault-injection experiments that Samurai is highly effective at preventing corruption of critical data.

Flikker: Saving DRAM Refresh Power through Critical Data Partitioning

DRAM refresh is a significant consumer of power in mobile systems. DRAM memories need to be constantly refreshed even when not in use or else they will lose their data. Memory manufacturers conservatively set the refresh rate of DRAM systems to that of the fastest-leaking cells. However, there is considerable variation among the leakage rates of memory cells in a DRAM and hence many cells retain their data even if the refresh rate is lowered.

The Flikker system assigns critical and non-critical data to different parts of DRAM, and lowers the refresh rate of the part containing non-critical data at the cost of introducing a modest number of errors in it. However, the part containing critical data is refreshed at the regular refresh rate and is hence error-free. This differentiated allocation strategy allows Flikker to obtain power-savings (up to 25%) with almost no reduction in the program’s reliability.

To the best of our knowledge, Flikker is the first system to intentionally lower hardware reliability and expose hardware errors to the software, in order to save power. This represents a novel point in systems design, namely trading off hardware reliability for power-savings, as hardware only needs to be as reliable as software.

Figure 2: Steps in operation of Flikker

Figure 2 shows the steps in the operation of Flikker. First, the programmer identifies critical data at the granularity of program objects. Second, the Flikker allocator assigns these objects to separate virtual pages, and does not mix critical and non-critical data on the same page. Third, the operating system (OS) maps the virtual pages containing critical data to the high-refresh portion of the DRAM. Finally, the DRAM chip is partitioned into a high-refresh and low-refresh portion, which can be configured by the OS before putting the mobile device to sleep. In sleep mode, the high-refresh portion is refreshed at the regular rate (32 milliseconds) while the rest of memory is refreshed at drastically lower refresh rates (1 second). The hardware changes required by Flikker are minimal and are based on the existing Partial Array Self-Refresh (PASR) feature of mobile DRAMs.

We have deployed Flikker on five applications chosen to represent a range of workloads on mobile platforms. We evaluate the performance overheads and power consumption of Flikker through trace-driven architectural simulations, and the degradation in applications’ reliability through fault-injection experiments. We find that Flikker can reduce overall DRAM power consumption by 20 to 25% on average, with negligible degradation in reliability and performance degradation of less than 1% for the applications considered.

People

Publications