Benchmarking OpenZFS vs EXT4 for my NAS

I’m building a NAS for myself and I was curious to see how OpenZFS would perform against ext4. My server will have full disk encryption on RAID1 and I couldn’t find a benchmark on a similar setup.

My hardware:

Intel i3-14100 CPU
2x32 GB DDR4 2666 MT/s RAM
1 TB NVMe for the OS
2x2 TB HDD in RAID1 for the main storage pool

I’m running NixOS with Linux Kernel 6.18.33.

The filesystem configuration is different for the two filesystems I compared. For ext4, I had the filesystem on top of LVM on top of LUKS on top of software RAID1. For ZFS, I used the native functionality for RAID and encryption, and also full disk compression on.

What I want to see is how these two filesystems compare for my use case:

backups: large (~10s GiB), streaming writes and reads
thumbnail/sidecar/metadata generation: small size (~10s KiB), random I/O. This is to simulate mainly Home Assistant SQLite access, Immich small file generation, and Darktable sidecar file access.
photography ingestion: medium size (~10s MiB), random I/O. This box is where I’ll store photos from my SLR camera, both RAW and processed.
random usage: I’m not running a single service at a given time. It is very possible that an automated backup happens while I browse Immich while Home Assistant updates its state.

I have a ton of RAM in this machine, and it definitely alters the results of the tests. To reduce the chances of I/O being served from RAM instead of from the HDDs, I allocated 50 GiB of zeros in RAM:

# My /tmp is a tmpfs
dd if=/dev/zero of=/tmp/crap bs=1024 count=$((50 * 1024 * 1024)) status=progress

For the ZFS tests, I capped ZFS ARC to 4 GiB:

echo $((4 * 1024 * 1024 * 1024)) | sudo tee /sys/module/zfs/parameters/zfs_arc_max

My OS is on a different ZFS pool, but there’s no way to configure ARC per pool, only system-wide.

I’m not sure it makes a difference, but I also turned off swap, which lives in my NVMe.

These changes are not bulletproof but help making the comparison a bit more fair.

Setting up storage

This is how I prepared the disks for these benchmarks. For ext4:

# Setup RAID1
mdadm --create --verbose --level=1 --raid-devices=2 /dev/md0 /dev/sda1 /dev/sdb1
# Wait until /proc/mdstat reports 100%

# Setup LUKS
cryptsetup luksFormat --type luks2 /dev/md0
cryptsetup open /dev/md0 nas_crypt

# Setup LVM
pvcreate /dev/mapper/nas_crypt
vgcreate nas_vg /dev/mapper/nas_crypt
lvcreate -l 100%FREE -n nas_data nas_vg
lvreduce -L -256M nas_vg/nas_data

# Setup ext4 fs
mkfs.ext4 /dev/nas_vg/nas_data

# Mount it
mkdir -p /mnt/nas
mount -o noatime /dev/nas_vg/nas_data /mnt/nas
chown --recursive h:users /mnt/nas

ZFS setup requires fewer steps, but has more knobs:

# Setup ZFS pool with encryption, compression, mirror
zpool create -O encryption=on -O keyformat=passphrase -O keylocation=prompt \
             -O compression=on \
             -O mountpoint=none \
             -O xattr=sa -O acltype=posixacl -O atime=off -o ashift=12 \
             main mirror /dev/disk/by-partlabel/mainTwoTB1 /dev/disk/by-partlabel/mainTwoTB2

# Create filesystem and mount it
zfs create -o mountpoint=legacy main/data
mount -t zfs main/data /mnt/nas

I first set up the ext4 system, then I ran the tests. After that, I formatted the HDDs, setup ZFS, and repeated the same tests.

When I say “ext4” I mean the complete stack: software RAID1, LUKS encryption, LVM, and ext4. Each one of these adds a layer in the VFS.

Theoretical values

Western Digital claims this HDD model’s transfer rate is up to 175 MB/s (~167 MiB/s). This puts a theoretical max write speed at 167 MiB/s and 334 MiB/s for read (2x factor from RAID1). Let’s see if my system can get close to these numbers.

Backup benchmark

For these tests, a backup workload is a task that writes or reads large files, sequentially. There is only a single process at a given time. I used fio for this benchmark:

# Write test: simulate creating a backup
fio --name=backup-write --rw=write --bs=1M --size=20G --numjobs=1 --direct=1 \
    --filename=/mnt/nas/fio-test --ioengine=libaio

# Read test, drop caches first: simulate restoring a backup
echo 3 | sudo tee /proc/sys/vm/drop_caches
fio --name=backup-read --rw=read --bs=1M --size=20G --numjobs=1 --direct=1 \
    --filename=/mnt/nas/fio-test --ioengine=libaio

I ran each test twice for each filesystem. The first one as a warm-up round and the second one was the actual benchmark. There’s no need to run this test N times and average them: fio already averages the results over the entire test. One thing to notice is both runs gave consistent results.

As a baseline, the average CPU usage while idle is ~0.5 %.

	ZFS	ext4
Sequential write	172 MiB/s	121 MiB/s
Avg latency write	2.8 μs	8.3 ms
P99 latency write	7.5 μs	61 ms
Peak CPU usage (w)	14.5 %	4.7 %
Sequential read	182 MiB/s	196 MiB/s
Avg latency read	4.4 μs	5.1 ms
P99 latency read	7.9 μs	5.9 ms
Peak CPU usage (r)	29 %	41.1 %

As expected: reads are faster than writes. I’m happy that I can see RAID1 giving more read bandwidth than a single disk would. This shows all is working as expected, although the difference in ZFS is curiously not significant.

I’m surprised by the difference in latency in ext4: read latency is more well behaved than write latency, as you can see with the p99 numbers. ZFS’s latency is three orders of magnitude smaller.

ZFS’s write performance against this ext4 stack is mind-blowing: speeds are better than the manufacturer claimed. Compression is magical, it reduces disk writes increasing bandwidth. But I expected a better read speed, RAID1 here didn’t impact as much as it did for ext4.

Metadata generation benchmark

Home Assistant writes to SQLite frequently, Immich generates small files, and Darktable reads and writes many sidecar files. These reads and writes are small (~10s KiB) and concurrent. I also used fio to simulate this scenario:

echo 3 | sudo tee /proc/sys/vm/drop_caches
fio --name=rand-rw --rw=randrw --rwmixread=50 --bs=4K --size=4G \
    --numjobs=4 --direct=1 --filename=/mnt/nas/fio-test \
    --ioengine=libaio --iodepth=32 --runtime=60 --time_based

	ZFS	ext4
Random write	473 KiB/s	782 KiB/s
Avg latency write	554 ms	348 ms
P99 latency write	1.2 s	2.0 s
Random read	452 KiB/s	754 KiB/s
Avg latency read	550 ms	321 ms
P99 latency read	1.2 s	2.1 s
Peak CPU usage	47.2 %	38%

I knew random I/O was slower than sequential I/O, but I was not expecting this huge difference. ext4 showed a ~200x reduction in bandwidth and a ~50x increase in latency. The impact on ZFS was more extreme: ~390x for bandwidth and ~150000x in latency.

Photography ingestion benchmark

I copied ~15 GiB of raw photos to my NVMe. Files range from 17 MiB to 28 MiB in size. This dataset has 717 photos, with around 70% of them in the range 20 MiB to 22 MiB.

This time I used rsync for the benchmark. I wanted to see how fast I can copy photos into my NAS’s HDDs.

I ran this test three times. I discarded the first result and got the worst numbers from the other two. A more rigorous benchmark would run several times and average the results, but for my purposes this conservative approach is enough. And I got very similar results from all runs for each filesystem.

echo 3 > /proc/sys/vm/drop_caches
rsync -a --stats /var/tmp/source/photos /mnt/nas/

	ZFS	ext4
Bandwidth	183 MiB/s	168 MiB/s
Peak CPU usage	29.8 %	41.2 %

ext4 surprised me here: the bandwidth is what the manufacturer claimed as maximum transfer rate.

ZFS’s bandwidth was even higher, due to compression being enabled. And its CPU usage was lower.

Random usage benchmark

In reality, this machine will be running different services simultaneously. It is very possible that an automated backup happens while Immich generates thumbnails while Home Assistant updates the database. I used fio to simulate this “closer to reality” scenario.

In this test, 60% of the I/O is reads.

echo 3 > /proc/sys/vm/drop_caches
fio --name=mixed-load \
    --rw=randrw --rwmixread=60 --bs=128K \
    --size=8G --numjobs=4 --direct=1 \
    --filename=/mnt/nas/fio-test \
    --ioengine=libaio --iodepth=16 \
    --runtime=120 --time_based \
    --group_reporting

	ZFS	ext4
Random write	15.0 MiB/s	16.8 MiB/s
Avg latency write	204 ms	131 ms
P99 latency write	743 ms	1.9 s
Random read	21.7 MiB/s	24.6 MiB/s
Avg latency read	204 ms	234 ms
P99 latency read	743 ms	1.4 s
Peak CPU usage	43.6 %	10.8 %

Bandwidth results here are significantly better than in the metadata benchmark. Average latency here is half of there. P99 latency for ZFS is also ~half, but similar for ext4.

Conclusion

It’s hard to answer the question of which filesystem is better here. For large sequential I/O, ZFS is faster with lower latency. For small and concurrent I/O, ext4 is faster and uses less CPU. For medium-sized sequential I/O, ZFS outperforms ext4 in bandwidth and CPU usage. For the random usage benchmark, ext4 is a bit faster and uses less CPU.

I suspect ZFS performance will be better on my real tasks, as I won’t limit ARC size and won’t have a hoard of zeros in RAM. Also, I can tune block size and logbias for each ZFS filesystem, but that requires further tests.

These synthetic benchmarks give me an upper bound on performance, but in reality there are other factors that will impact performance. For example, my home network is 1 gigabit: I won’t be able to copy files to/from my NAS faster than ~120 MiB/s even if the disks and the machine can handle that. Even though fio simulates my workloads, they are only a simulation. A good synthetic result is not a guarantee of good production performance.

These tests are only for performance, and absolute performance is not all that matters. HDDs can silently fail and return corrupted data. mdadm can detect when a disk in the array failed but not that it returned wrong data. ext4’s journal protects metadata after power loss and crashes, but doesn’t protect data against bit rot. ZFS checksums data and metadata and detects corruption during reads; if there’s a mirror/zraid, it can also repair corrupted data via scrubs. For storing irreplaceable bits in my NAS, data integrity is more important than any performance difference I measured.

I had already made up my mind before running these benchmarks, I’m going with ZFS. I did this out of curiosity and for fun and ended up learning a lot about the workloads I’ll have.