TrueNAS + ZFS data compression deduplication

 

 

That TrueNAS and ZFS are a great pair is already known. I talked about TrueNS as such . . About how it works and why it is so cool ZFS itself no and how to optimize it and when it is worth doing it I told . It's time to optimize our disk space usage. We'll try to make it so that we need fewer disks than we need, but again, when is it worth it, and how much can be saved.

ZFS Deduplication

Let's start with deduplication. We are talking here about the situation when we store repetitive data, The simplest case is when the same movies photos documents etc. are lying around on our TrueNAS several times. When saving to disk, deduplication mechanisms detect that we already have such data on disk, and instead of saving it again we only save linking information to already saved data save space. Cool right?

ZFS Deduplication - how it works

The way it works is that if we have deduplication enabled then when we write data, the data is divided into predetermined data chunks from 4 Kb to 1 MB. Blocks of data are written to the disk. From each block a hash is performed, a hash function which, based on the contents, generates a highly unique, identifier for a given block of data. Metadata in the form of the block's address on disk and its hash are stored in a separate data set. Before any new data is written, a hash of the data is done first and the metadata is checked to see if we already have a block with such a hash. If so, a new block of data does not land on disk, only information about the location of such a block is stored in the metadata table.

 

Please note that it is not at all talking about files only and blocks of data. This makes it enough for the files to hijack partially, for example, differ in the end, and already the deduplication process will work for us. Only the differing part and not the whole file will be written to disk. At this point, block size becomes important. By reducing the size of the blocks, we make the probability of the identity of two blocks grow and thus the probability of deduplication, but at the same time the amount of metadata to be handled grows. The set of metadata is collected per vPool, which means that if we have repeated blocks in different Datasets, they will be deduplicated.

ZFS Deduplication - advantages

The only but indisputable advantage of deduplication is the reduction in disk space occupancy. If ten users put the same disk images, movies, documents, pdfs or others into their folders on the server, the data reduction can reach tenfold. Of course, this is an extremely optimistic case and rather rare 🙂

ZFS Deduplication - Disadvantages

Everything nice, beautiful but, what can go wrong? 🙂 well at least a few things. The first drawback is the necessary processor resources to calculate the hash of each block of data. It is worth mentioning here that the big support here is AES-NI is a set of processor instructions related to hardware support of encryption, but it also comes in handy when calculating the hash. A processor without AES-NI can calculate hash check totals several times slower. If the performance of our data server is not crucial, or simply hardly loaded, it does not matter. If the server is heavily loaded, it is important to check before buying a used server whether the processor has this feature. Probably all new processors have AES-NI support.
In short, deduplication increases the load on the processor.

 

The second disadvantage of deduplication is the need to write additional metadata. This involves the need for additional space and the load on the disks during writing, i.e. slowing down the writing. If the data is unique then we need to write more data to the disk, the data itself and the metadata. The amount of metadata roughly depends on, obviously, the amount of data, and, less obviously, on the size of the blocks. With small blocks, relatively more metadata is produced. As a result, with unique data, we can increase the disk fill by several tens of percent. And that is not the point here 🙂
A good idea to speed things up is to use a dedicated NVMe disk for deduplication metadata. It is fast and operations on it do not compete with operations on data disks. Remember, however, that deduplication data is non-reproducible, if you lose deduplication data you also lose the data itself. That is, the vDev for deduplication metadata must be redundant, a minimum of two disks.
In short, if we have unique data, the effect of enabling deduplication will be to increase disk occupancy and write time.

 

An additional disadvantage of deduplication is that it also increases the demand for RAM by the presence of metadata in them. That is, by enabling deduplication, we automatically reduce the available size of the ARC cache, or data cache, for ourselves, . opisuje szczegóły tego mechanizmu.

ZFS Deduplication - summary

To sum up deduplication, it is a very interesting solution, but it certainly should not be enabled by default on all installations. The basis for considering deduplication is to answer the question: do I have repetitive data on the disk? If we have repetitive data then it is definitely worth considering enabling. If, on the other hand, we don't have repetitive data then we will increase the disk occupancy, slow down the writing, load the CPU. Keep in mind that deduplication can be enabled per Dataset, meaning that if we consider that we have repeating in only one Dataset then we will only enable it in that Dataset. Places where we can generally expect effective performance are places with repeating files for several users like system files of similar virtual systems, WordPress engine files in hosting, etc.

ZFS Compression

Compression in ZFS is probably a completely self-descriptive term. The mechanism involves on-the-fly compression of data before writing to disk and decompression on reading. It is worth mentioning that the compression feature in TrueNAS is enabled by default when creating new vPools. Compression is configurable per Detaset meaning we can choose which data to compress and which not. In addition, we can choose the compression mechanism such as ZSTD, GZIP, LZ4 but there are many more. The default one is LZ4. Each of them has a slightly different set of parameters such as:compression efficiency, CPU load when compressing and when decompressing. As a general rule of thumb, the better the compression, the higher the CPU load, but this is just a general rule. Sometimes some mechanisms are simply better than others. The efficiency and compression comparison of different mechanisms is definitely material for a separate episode.

ZFS Compression - Advantages

The advantage of compression, not surprisingly, compresses the data and reduces the amount of space required. In extreme cases, even several, several times. An additional advantage in TrueNAS is that it is enabled in the background and just works.

ZFS Compression - Disadvantages

The disadvantage of compression is the load on the processor every time you write to and read from the disk. There are data that are in practice no longer compressible because they are already optimally compressed like movies and photos. In this case, each time they are compressed and decompressed is just a waste of CPU time, power, heat released, CO2 and whatnot.

ZFS Compression - summary

In summary, ZFS compression in general is a good solution and in most cases will give you disk space savings. The efficiency of compression is again mainly dependent on the type of data and the compression method chosen. On the other hand, any data stored in text form compresses very well, and documents also compress quite well. As far as contraindications are concerned, for example, we will create a Dataset for storing photos or videos then I recommend turning off compression because these data generally can't be compressed anyway. A general contraindication is to compress data that is already compressed: photos, videos, compressed backups, compressed virtual machine images, etc.

Summary

It is worth mentioning that ARC cache stores the most frequently used data in RAM. This means that access to this data is not delayed by deduplication and decompression processes. The data is available immediately in practice without any delay.
I would also add that all changes made to the Dataset take effect from the next save. That is, if we change some settings regarding compression or deduplication every next write will be performed according to the new settings. It doesn't fire off any automatic data conversion process in the background or anything like that. From what I know, although I'm not sure, the easiest way to deduplicate or compress existing data after changing the settings is to upload them back there. You simply move them to another Dataset and move them back. This way the data will be milled through the deduplication and compression storage mechanisms. The same if there was a need to de-deduplicate or de-compress the data on the disks. The processes of deduplication or compression of once saved data are simply not reversible.

 

Fortunately, changes in compression and deduplication settings are "backward compatible." The idea is that we can change the settings of newly saved data and data previously saved in other settings will be read without problems regardless of the current save settings.

 

As for universally predicting the effectiveness of deduplication and compression ... well this is simply not possible because there is no single representative data set, everyone has a different data set, and the effectiveness of these mechanisms depends mainly on the data stored.

 

One more note, it's probably clear, but for the sake of argument I'll say that both compression and deduplication take place at the server level, which means that the data must first get there before it is analyzed. If the bottleneck is the Internet connection, none of the mechanisms presented on the server will not speed up our writing or reading or reduce the amount of data transferred.

 

To sum up deduplication and compression in ZFS, again, one would have to say that these are nice solutions, but as usual, you have to understand what you are doing first and foremost before you start clicking on a production machine. Turning on "on the spur of the moment" these features admittedly shouldn't break anything, but it might just slow down our server significantly.

 

I INVITE YOU TO WATCH

Recommendations

Simulation of the benefits of deduplication

zdb -U /data/zfs/zpool.cache -S [nazwa pool]

Status of deduplication

zpool list [nazwa pool]