Definithing > Data Deduplication

Data Deduplication

Data deduplication is a technique for reducing the amount of storage space an organization needs to save its data. In most organizations, the storage systems contain duplicate copies of many pieces of data. For example, the same file may be saved in several different places by different users, or two or more files that aren’t identical may still include much of the same data. Deduplication eliminates these extra copies by saving just one copy of the data and replacing the other copies with pointers that lead back to the original copy. Companies frequently use deduplication in backup and disaster recovery applications, but it can be used to free up space in primary storage as well.

In its simplest form, deduplication takes place on the file level; that is, it eliminates duplicate copies of the same file. This kind of deduplication is sometimes called file-level deduplication or single instance storage (SIS). Deduplication can also take place on the block level, eliminating duplicated blocks of data that occur in non-identical files. Block-level deduplication frees up more space than SIS, and a particular type known as variable block or variable length deduplication has become very popular. Often the phrase “data deduplication” is used as a synonym for block-level or variable length deduplication.

Deduplication Benefits

The primary benefit of data deduplication is that it reduces the amount of disk or tape that organizations need to buy, which in turn reduces costs. NetApp reports that in some cases, deduplication can reduce storage requirements up to 95 percent, but the type of data you’re trying to deduplicate and the amount of file sharing your organization does will influence your own deduplication ratio. While deduplication can be applied to data stored on tape, the relatively high costs of disk storage make deduplication a very popular option for disk-based systems. Eliminating extra copies of data saves money not only on direct disk hardware costs, but also on related costs, like electricity, cooling, maintenance, floor space, etc.

Deduplication can also reduce the amount of network bandwidth required for backup processes, and in some cases, it can speed up the backup and recovery process.
Deduplication vs. Compression

Deduplication is sometimes confused with compression, another technique for reducing storage requirements. While deduplication eliminates redundant data, compression uses algorithms to save data more concisely. Some compression is lossless, meaning that no data is lost in the process, but “lossy” compression, which is frequently used with audio and video files, actually deletes some of the less-important data included in a file in order to save space. By contrast, deduplication only eliminates extra copies of data; none of the original data is lost. Also, compression doesn’t get rid of duplicated data — the storage system could still contain multiple copies of compressed files.

Deduplication often has a larger impact on backup file size than compression. In a typical enterprise backup situation, compression may reduce backup size by a ratio of 2:1 or 3:1, while deduplication can reduce backup size by up to 25:1, depending on how much duplicate data is in the systems. Often enterprises utilize deduplication and compression together in order to maximize their savings.
Dedupe Implementation

The process for implementing data deduplication technology varies widely depending on the type of product and the vendor. For example, if deduplication technology is included in a backup appliance or storage solution, the implementation process will be much different than for standalone deduplication software.

In general, deduplication technology can be deployed in one of two basic ways: at the source or at the target. In source deduplication, data copies are eliminated in primary storage before the data is sent to the backup system. The advantage of source deduplication is that is reduces the bandwidth requirements and time necessary for backing up data. On the downside, source deduplication consumes more processor resources, and it can be difficult to integrate with existing systems and applications.

By contrast, target deduplication takes place within the backup system and is often much easier to deploy. Target deduplication comes in two types: in-line or post-process. In-line deduplication takes place before the backup copy is written to disk or tape. The benefit of in-line deduplication is that it requires less storage space than post-process deduplication, but it can slow down the backup process. Post-process deduplication takes place after the backup has been written, so it requires that organizations have a great deal of storage space available for the original backup. However, post-process deduplication is usually faster than in-line deduplication.
Deduplication Technology

Data deduplication is a highly proprietary technology. Deduplication methods vary widely from vendor to vendor, and many of those methods are patented. For example, Microsoft has a patent on single instance storage. In addition, Quantum owns a patent on variable length deduplication. Many other vendors also own patents related to deduplication technology.

Read Also:

Data Dictionary
In database management systems, a file that defines the basic organization of a database. A data dictionary contains a list of all files in the database, the number of records in each file, and the names and types of each field. Most database management systems keep the data dictionary hidden from users to prevent them […]
data flow model
Data flow model is a graphical representation produced by data flow modeling. Also referred to as a data flow diagram (DFD).
Data Entry
The process of entering data into a computerized database or spreadsheet. Data entry can be performed by an individual typing at a keyboard or by a machine entering data electronically.

data flow modeling
The process of identifying, modeling and documenting how data moves around an information system. Data flow modeling examines processes (activities that transform data from one form to another), data stores (the holding areas for data), external entities (what sends data into a system or receives data from a system, and data flows (routes by which […]
data glove
A glove equipped with sensors that sense the movements of the hand and interfaces those movements with a computer. Data gloves are commonly used in virtual reality environments where the user sees an image of the data glove and can manipulate the movements of the virtual environment using the glove.