This page contains ext4 performance and scalability data collected from the upstream 3.2 kernel, and a comparative reference baseline from 3.1. To permit comparisons with other Linux filesystems, the same measurements are made on ext3, xfs, and btrfs. Generally, each filesystem's default mkfs and mount options are used with a few exceptions intended to make for a more useful comparison. These are mentioned in the description of the test system configuration below. The primary goal of this work is to evaluate some of ext4's scalability properties as a function of core and thread count in both journaled and non-journaled configurations.
The benchmarks used are modified versions of the Boxacle ffsb workloads, with adjustments to the original ffsb profiles to run 1 thread, 48 threads (one per CPU), and 192 threads (four per CPU), among other things, to better match them to the 48 core test system.
The data shown in the graphs linked below are from lockstat runs. We collect two sets of lockstat data, and use the one which most closely matched the results from three additional runs made without lockstats. These additional runs help assess the impact of lock statistics collection and run to run variability. Lock statistics collection has a significant effect on most workloads at one thread, depressing throughput by 10 to 20%. It has a much less significant effect at 48 or 192 threads, where the effect is usually indistinguishable from run to run variability. The mail_server workload exhibits the most significant run to run variability; the other workloads tend to yield relatively similar results over multiple runs.
There is a large improvement for this workload's results on journaled ext4. The improvement bisects cleanly to Wu Fengguang's IO-less balance_dirty_pages() patch (143dfe8611a63030ce0c79419dc362f7838be557 in the 3.2 merge window). jbd2 tracepoint data (3.1, 3.2) suggest that the patch forced all journal I/O into the background in more frequently and regularly issued but smaller batches and in groups of more or less adjacent inodes. Previously, inodes found in a single batch of journal I/Os tended to be scattered. Pressure on jbd2's j_state_lock, previously thought to be the limiting factor for this workload, was greatly reduced.
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
ffsb reports much higher throughput rates for this workload as compared to the others. We think this may be due to a relatively low amount of generated I/O relative to the amount of cache on the RAID controller. The test system also has a lot of memory for page cache, and that should tend to convert a signficant portion of the generated random I/O into more sequential write patterns than we might want.
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
The strong throughput numbers btrfs delivers for this workload appear to be a function of an internally set 4 MB read ahead value. If a similarly aggressive read ahead value is set for the block devices used for ext4 or xfs filesystems, overriding the 128 KB default, similar throughput levels can be obtained. Therefore, it doesn't look like ext4 is actually in need of improvement for this workload at present.
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
This workload appears to be limited by the throughput of the storage system, so it doesn't reveal anything particularly interesting about ext4 right now.
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
The mail server workload performs both reads and writes during execution. The read and write throughput results are broken out in separate graphs. Because it's not possible to isolate the CPU utilization for the read operations from the write operations, aggregate utilization (cost) is reported in a single graph rather than in separate graphs for the read and write cases.
Unlike the other workloads, the mail server workload tends to exhibit a lot of run to run variability. There may be something interesting in that variability, but more work needs to be done to adjust this workload to the test system.
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.1:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs