Eliminating Duplicated Primary Data

Published

By Douglas Gantenbein

The amount of data created and stored in the world doubles about every 18 months. Some of that data is distinctive—but by no means all of it. A PowerPoint (opens in new tab) presentation might start bouncing around a work group, and within a week, many nearly identical copies could be scattered across an enterprise’s desktops or servers.

Eliminating redundant data is a process called data deduplication. It’s not new—appliances that scrub hard drives and delete duplicate data have existed for years. But those appliances work across multiple backup copies of the data. They do not touch the primary copy of the data, which exists on a live file server.

MICROSOFT RESEARCH PODCAST

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.

Deduplicating as data is created and accessed—primary data, as opposed to backup data—is challenging. The process of deduplicating data consumes processing power, memory, and disk resources, and deduplication can slow data storage and retrieval when operating on live file systems.

Sudipta Sengupta, Principal Research Scientist, Microsoft Research

Sudipta Sengupta, Principal Research Scientist, Microsoft Research

“People access lots of data stored on servers,” says Sudipta Sengupta (opens in new tab), a senior researcher with Microsoft Research Redmond (opens in new tab). “They need fast access to that data—ideally, as fast as without deduplication—so it’s a challenge to deduplicate data while also serving it in real time.”

Sengupta–along with Jin Li (opens in new tab), a principal researcher at the Redmond facility–have cracked that nut. In a partnership with the Windows Server 8 (opens in new tab) team, they have developed a fast, effective approach to deduplicate primary data, and they have delivered production code that will ship with Windows Server 8.

The researchers began work on data deduplication about four years ago. Sengupta and Li believed there were big opportunities for reducing redundancies within primary data, an area that hadn’t really been examined because of the impact deduplication could have on a server managing live data. They built a tool that would crawl directories on servers and analyze the data for deduplication savings. This showed that there were significant redundancies in primary data.

Sengupta and Li next tackled the problem of detecting duplicated data. That required building and maintaining an index of existing data fragments—also called “chunks”—in the system. Their goal was to make the indexing process perform well with low resource usage. The Microsoft Research team’s solution is based on a technology they designed called ChunkStash (opens in new tab), for “chunk metadata store on flash.” ChunkStash stores the chunk metadata on flash memory in a log-structured manner, accesses it using a low RAM-footprint index, and exploits the fast-random-access nature of the flash device to determine whether new data is unique or duplicate. Not all of the performance benefits of ChunkStash are dependent on the use of flash memory, and ChunkStash also greatly accelerates deduplication when hard disks alone are used for storage, which is the case in most server farms.

Product-Team Engagement

Sengupta and Li’s work on deduplication caught the eye of the Windows Server team, which was in the early stages of working on Windows Server 8. The opportunity to include deduplication in the release was tempting and driven by customer needs and industry trends.

“Storage deduplication,” says Thomas Pfenning, general manager for Windows Server, “is the No. 1 technology customers are considering when investing in file-based storage solutions.”

The process of deduplication breaks up data into smaller fragments that become the target for a deduplication, too. These fragments could be entire files or “chunks” of a few kilobytes. Because data is subject to edits and modifications over time, breaking data into smaller chunks and deduplicating those smaller pieces might be more effective than finding and deduplicating entire files.

Take a PowerPoint presentation, for instance. A dozen slightly different versions might exist on a server. Is it better to find entire files that are identical and toss out the spares, or to unearth the pieces multiple files might have in common, and remove those duplicates?

To find out, the Microsoft team analyzed data from 15 globally distributed servers within Microsoft. These servers contained data folders of single users’ Office files, music, and photos; files shared by workgroups; SharePoint team sites; software-deployment tools; and more.

They discovered that chunking data resulted in significantly larger savings compared with deduplication of entire files.

Jin Li, Partner Researcher Manager of the Cloud Computing and Storage (CCS) group in Microsoft Research – Technologies

Jin Li, Partner Researcher Manager of the Cloud Computing and Storage (CCS) group in Microsoft Research – Technologies

“If people are working on a PowerPoint presentation, the file can be edited lots of times,” Li says. “That generates a lot of different versions, and although the files are not the same, they have a large amount of common data.”

They also found that the use of higher average chunk sizes, in range of 70 to 80 kilobytes, together with chunk compression, could preserve the high deduplication savings typically associated with much smaller chunk sizes—4 to 8 kilobytes—that previously have been used in the context of backup data deduplication. This has huge implications for a primary data server, because larger chunk sizes reduce chunk metadata and the number of chunks stored in the system, leading to increased efficiencies in many parts of the pipeline, from deduplicating data to serving data.

From Research to Production

The Microsoft Research team contributed in three key areas in designing and building a production-quality data deduplication feature in Windows Server 8: data chunking, indexing for detecting duplicate data, and data partitioning.

For the first contribution, Sengupta and Li devised a new data-chunking algorithm, called regression chunking, that achieves a more uniform chunk-size distribution and increased deduplication savings.

Their second contribution was the indexing system for detecting duplicate data. The researchers used ideas from their ChunkStash research project to deliver a highly efficient chunk-indexing module that makes light use of CPU, memory, and disk resources.

“Indexing is usually a big bottleneck in deduplication,” Sengupta says. “It’s a big challenge to build a scalable, high-performance index to identify the duplicate chunks without slowing down performance.”

In the third contribution, the Microsoft researchers worked with the product team to devise a data-partitioning technique that scales up as a data set grows. Partitioning data enables the deduplication process to work across a smaller set of files, reducing resource consumption.

Through data analysis, they found that two partitioning strategies—partitioning by file type or by file-system-directory hierarchy—work well in terms of negligible to marginal loss in deduplication quality through partitioned processing. The system also includes an optional reconciliation process that can be used to deduplicate across partitions if significant additional space savings can be extracted.

Finally, the Microsoft Research team worked with the Windows Server team to write production code for the new data deduplication feature in Windows Server 8. For a time, Sengupta and Li wore two hats—working both on research on deduplication and writing code for use in Windows Server 8. They were joined by Microsoft Research colleagues Kirk Olynyk, a senior research software-design engineer, and Sanjeev Mehrotra (opens in new tab), a principal software architect, in shipping production code to the Windows Server 8 team.

“It was a great collaboration between Microsoft Research and the product team,” Li says. “We got very good feedback from the team, and some of the challenges they posed to us helped make the product better.”

Evidence of that is in the results Windows Server 8 will yield in terms of reducing the need for server storage. In one recent demo, a virtual-hard-drive (VHD) store holding 10 terabytes of VHD files consumed only 400 gigabytes of disk space. In this instance, as much as 96 percent of the data was detected as duplicate and then eliminated, because of the presence of identical or slightly different operating-system and application-software files in the VHDs.

Windows Server 8 was shown in mid-September during Microsoft’s BUILD conference, and a preview edition is available to developers, with final release expected in 2012. Its data deduplication capability has been widely praised. A “killer feature (opens in new tab),” wrote ITWorld. “An impressive feature (opens in new tab) that should do wonders for storage efficiency and network utilization,” added Windows IT Pro. Ars Technica added to the chorus (opens in new tab): “Microsoft demonstrations of the technology reduced the disk footprint of a [virtual desktop infrastructure] server by some 96 percent.”

The Windows Server 8 team also is happy with the new addition to their product.

“We are very pleased with the end result of the collaboration with Microsoft Research,” Pfenning says. “It’s great to see research work coming through in a product that we expect to bring tremendous customer value.”

Related publications

Continue reading

See all blog posts