Use Your Storage Space More Effectively with ZFS: Introduction

By Alexandru Andrei, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud’s incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

Warning: If you’re going to use ZFS on your Cloud Disks, don’t use the snapshot feature included in the Alibaba console. When you roll back to those backups, the potential for corruption exists, especially when using multiple disks. This is because ZFS keeps structural data on these devices and if the snapshots aren’t all taken at the exact same time, the discrepancies may lead to missing data or corruption. ZFS includes its own snapshot feature and also has the ability to easily replicate these to other systems with commands such as zfs send and zfs receive.

While the default ext4 filesystem, used by most of the Linux distributions, might cover basic user needs, when advanced features are required you need to look at alternatives. ZFS is one such alternative, one of the most important and most capable, packing a long list of features and sophisticated mechanisms of ensuring data integrity. Since it’s a complete data storage solution, besides a filesystem, ZFS also includes a volume manager, similar in scope to the Logical Volume Manager (LVM) but more efficient in certain areas. To decide whether you actually need this tool in your infrastructure, here’s a list of things it can do, which will be detailed in the next section:

  1. Makes administering storage devices much easier and more flexible. Allows most of the operations, like expanding, replacing devices, correcting errors, to be done online (on a running system and without requiring a reboot).
  2. Data compression can significantly boost performance and reduce disk space requirements.
  3. Snapshots can be used to easily rollback undesired or unexpected changes. They also provide a means to create fast, off-site, consistent, incremental or differential backups (zfs send, zfs receive).
  4. Clones allow users to branch off/fork content, working simultaneously on multiple versions, without affecting the original. Only the differences are stored, reducing disk space requirements.
  5. Overall read/write performance can be significantly increased in certain setups.
  6. Redundant ZFS pools can self heal in the event of errors. Extremely reliable checksumming of data, Merkle Trees and many other techniques ensure not even a bit of information can change unnoticed.
  7. Atomic transactions — they either entirely succeed or they are entirely cancelled/aborted. This protects you against partial writes/changes that leave other filesystems in inconsistent states. If power is lost during a write, you only lose data that didn’t have time to reach the disk, but you won’t get corruption.
  8. More efficient cache (ARC — Adaptive Replacement Cache).
  9. Being a copy-on-write system, it rarely overwrites data, which gives a lot of windows of opportunity to recover from mistakes. It also makes it SSD-friendly.

Advantages of Using ZFS

  1. You can take snapshots of datasets. It’s a way to “freeze” the state of the dataset to a point in time, keeping it on the system, unchanged, for as long as you require it. The origin dataset continues to operate as normal and, initially, no additional disk space is required to keep the frozen dataset, only differences that are added to the origin are stored. Snapshots can be useful for taking consistent backups, because data cannot change during the time you are copying it somewhere else. Another use they can have is providing a safe point you can return to (rollback). For example, you can take a snapshot before risky operations such as cleaning up or optimizing a database. In case something went wrong, you can quickly restore the dataset containing your database. Because of the way ZFS is built (copy-on-write), rolling back is fast, even if you have to revert hundreds of GB of changes.
  2. Snapshots are very efficient and don’t require preallocating space for storing changes (as is the case with LVM). There’s also virtually no limit to the number of snapshots you can create and no noticeable performance degradation even when a large number of these are created. This allows freedom to implement extreme use-cases, such as taking a snapshot of a filesystem every minute (and also backing up off-site if required).
  3. High-frequency, off-site backups, that could only take seconds to upload, are made possible. Snapshot replication is very efficient because, unlike other tools, ZFS doesn’t need to waste time first scanning data and comparing what has changed. Simply put, a tool like rsync would have to scan, file by file, and look for changes, then also write, file by file, on the destination server, which besides taking more time, results in more I/O operations on disks. ZFS just “knows” that “these data blocks here are what changed between snapshot1 and snapshot2”. Copying raw data, instead of creating file by file, results in less I/O.
  4. Clones can be used to save disk space on very similar objects or work on multiple versions of a product at the same time. Example: you can snapshot an entire website (code files, images, resources, database, etc.) Afterwards, you can create 10 clones which will initially use no additional disk space (except for metadata, but that is negligible). Only the changes that will be added to each clone will be stored (differences between origin and clones). Then, you can have different teams work on improvements, each with its own approach. They all have their own isolated environments (clones), so you can easily compare results at the end of the project. After testing each clone, you can promote the most successful one to take over the original dataset. Of course, clones can be used in many other creative ways.
  5. When storage pools are set up with redundancy (mirrored devices or devices set up in Raid-Z arrays), ZFS has very sophisticated and efficient algorithms of ensuring data integrity and is also able to self heal when errors are detected.
  6. In certain setups, reads and writes can be distributed, in parallel, across multiple storage devices, speeding up these operations.
  7. Data can be compressed on the fly, speeding up reads and writes and saving disk space. This should be activated on every dataset except those that store data that is already compressed. For example, there is no point in turning on compression when storing video files, zip archives or jpeg images, since all of these are already compressed.
  8. Adaptive Replacement Cache (ARC) is much better than the default Least Recently Used (LRU) caching mechanism that operates when we use the ext4 filesystem. Files that we read/write from/to storage are cached (copied) in random access memory (RAM), which is much faster than hard-drives or SSDs. This greatly speeds up subsequent reads of the same data when the system needs it again since it bypasses the physical device and gets it directly from memory. The LRU’s problem is that it’s very rudimentary, caching the last read file unconditionally and potentially evicting (deleting) useful files from memory. For example, we might have useful files in the cache, at which point we download a 7GB file. If we only have about 8GB of RAM, all of the previous caches will be flushed to store the downloaded file. This is detrimental, because the previous cache contained files that the system reads often, while the new cache, contains a file we downloaded, that we may not need in the near future. The ARC is much smarter and doesn’t flush from cache files that it sees are being accessed often.

Structure of ZFS Pool

Here are some of the components that go into making a ZFS storage pool:

  1. pool — One or more vdevs build a storage pool. You can think of it as a virtual large disk made up of all the virtual devices added to it.
  2. vdev (virtual device) — One, or more physical storage devices grouped together, build a vdev. There are multiple types of virtual devices:
  3. disk — When the vdev consists of a single storage device. Multiple one disk vdevs can be added to a single pool. This will increase pool capacity and make ZFS stripe(split and spread) data across all of its vdevs, therefore reading and writing will be faster. To illustrate, in a pool consisting of 4 disk vdevs, each having the same capacity, when you write 4GB of data, this will be split (striped) in four equal parts and 1GB will be sent to each disk at the same time.
  4. mirror — Made from two or more disks. In this type of vdev, ZFS keeps data mirrored (identical) on all of the physical devices it contains. You can add multiple mirror vdevs to a pool to stripe data across them, in a similar fashion to the scenario described earlier. Example: you add disk1 and disk2 to mirror1, disk3 and disk4 to mirror2. Writing 4GB of data will send 2GB to mirror1 and 2GB to mirror2. You can lose disk1 and disk3 and still recover all data. But if you lose all disks in the same mirror (disk1 and disk2), then the whole pool is lost.
  5. Raid-Z — Two or more devices can be added to this type of vdev and data can be recovered if sufficient information, called parity, survives. Raid-Z1 uses one disk for parity, Raid-Z2 uses two and so on. In a Raid-Z2 vdev, we can lose a maximum of 2 devices and still be able to rebuild data. In a Raid-Z1 setup we can only lose one.
  6. log — The ZFS Intent Log (ZIL) is normally stored on the same devices where the rest of the data is stored. But a Separate ZFS Intent Log (SLOG) can be configured. When hard-disks are used to store data, a log vdev backed by faster storage media, such as SSDs, can help increase performance and reliability. Here’s an overly-simplified illustration to help you understand how it works. Most writes are delayed so that the system can do some optimizations before physically storing data. Other types of writes are considered urgent, so a program can instruct the system to flush data as fast as possible. A database server can say: “Write this customer data now since it’s important and we want to make sure we don’t lose it”. This is called a synchronous write. Without a SLOG, ZFS will send the ZIL to the hard-disks. Then, the data in ZIL will be organized and written to its final destination. At the same time, other write jobs may be active on this disk and mechanical hard drives are inefficient when trying to write to multiple locations at the same time, which can result in degraded performance. With a SLOG, ZIL data can be sent to an SSD while the hard-disk completes other jobs, resulting in faster response times and higher throughput when dealing with a lot of synchronous writes. Furthermore, the SLOG vdev is now faster, so, in case of a power failure, chances are higher that it will finish capturing a synchronous write. When the system reboots, the ZIL can be replayed, and the data that hit the SLOG, but didn’t have time to go to the hard-disk, can now be stored properly. This helps eliminate data loss or at least minimize it when write operations are abruptly interrupted. And, as mentioned, operations are atomic, meaning that a “half-write” will be discarded, so you don’t get inconsistent data and corruption. In the worst case, you will have older data on disk (what was written in the last few seconds before power loss won’t be available).
  7. cache — A vdev that can be used to cache frequently accessed data. Only makes sense to be used when the cache device is faster than the devices used to store data. Example: create a cache vdev on an SSD when your storage pool is made out of (slower) mechanical hard-disks. If you already use SSDs in your storage pools, then a cache device doesn’t help. In the cloud, this hybrid structure can be used to optimize costs: build your pools out of cheaper storage devices and add a more expensive, faster device with more IOPS for the cache vdev.
  8. spare — In this vdev, you add a device that you want to designate as a replacement for any drive that fails in your ZFS array. By default, you have to manually use a command to replace a failed drive with your spare but steps can be taken to instruct ZFS to automatically do so when needed.
  9. dataset — This can be a ZFS filesystem, snapshot, clone or volume.
  10. volume — The storage pool in its entirety, or parts of it, can be used as (a) volume(s). These are virtual block devices that can be used just like real partitions or disks. You can consider it a ZFS partition but not formatted with the ZFS filesystem. It’s useful when you need the advantages offered by ZFS (can still be snapshotted, cloned, etc.) but you have to use a different filesystem on the storage pool or just need the raw (virtual) devices to feed to other applications, for example as virtual disks to virtual machines.

How to Choose the Right Type of ZFS vdev

  1. How much redundancy can you afford? Say you need to store 10 terabytes of information. You would have to pay for 20 terabytes of space to create a mirror with two devices, where you can lose one of them and still recover. And while you pay for 20 terabytes, you can only use 10. If you want to be able to recover after losing 2 devices, you would have to pay for 30 terabytes of data, and so on. Raid-Z can help here. In Raid-Z2 for example, if you buy 12 one terabyte disks, you are able to store 10 terabytes of data and only 2 terabytes are used for redundancy, instead of the 10 used by the mirror. And you can still lose two devices and recover. But though it can save money/resources, Raid-Z has other types of cost: recovery (resilvering) takes a longer time than is the case with mirrors and the operation puts more stress on the devices, read performance is slightly lower and you can’t add or remove disks to/from a Raid-Z vdev after it has been created. Every vdev comes with its own advantages and disadvantages so you will have to decide what works best for you.
  2. Do you prefer speed over reliability or vice versa? Is read speed more important than write speed or vice versa? Example: adding more devices to a mirror vdev doesn’t increase write throughput but it does increase read bandwidth and data reliability. Adding more regular disk vdevs increases both read/write speeds but decreases reliability. Raid-Z has better write performance than mirrors but it’s slower on reads.
  3. Flexibility: certain types of vdevs can be more flexible than others. You can add devices to a mirror vdev or remove them (if more than 2 devices are used). But the number of devices in a Raid-Z vdev cannot be changed after creation.

In the next tutorial, Use Your Storage Space More Effectively with ZFS — Exploring vdevs, we will configure ZFS on an Ubuntu instance, create a pool and learn how to use each type of vdev.


Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store