This page contains ext4 performance and scalability data collected from the upstream 3.6-rc3 kernel, and a comparative reference baseline from 3.2. To permit comparisons with other Linux filesystems, the same measurements are made on ext3, xfs, and btrfs. Generally, each filesystem's default mkfs and mount options are used with a few exceptions intended to make for a more useful comparison. These are mentioned in the description of the test system configuration below. The primary goal of this work is to evaluate some of ext4's scalability properties as a function of core and thread count in both journaled and non-journaled configurations.
The benchmarks used are modified versions of the Boxacle ffsb workloads, with adjustments to the original ffsb profiles to run 1 thread, 48 threads (one per CPU), and 192 threads (four per CPU), among other things, to better match them to the 48 core test system.
The data shown in the graphs linked below are from lockstat runs. We collect two sets of lockstat data, and use the one which most closely matched the results from three additional runs made without lockstats. These additional runs help assess the impact of lock statistics collection and run to run variability. Lock statistics collection has a significant effect on most workloads at one thread, depressing throughput by 10 to 20%. It has a much less significant effect at 48 or 192 threads, where the effect is usually indistinguishable from run to run variability.
In the 3.2 release, there was a large scalability improvement for this workload on journaled ext4. The improvement bisected to Wu Fengguang's IO-less balance_dirty_pages() patch (143dfe8611a63030ce0c79419dc362f7838be557 in the 3.2 merge window). jbd2 tracepoint data (3.1, 3.2) suggested that the patch forced all journal I/O into the background in more frequently and regularly issued but smaller batches and in groups of more or less adjacent inodes. Prior to 3.2, inodes found in a single batch of journal I/Os tended to be scattered. Pressure on jbd2's j_state_lock, previously thought to be the limiting factor for this workload, was greatly reduced. The writeback code is now the biggest source of contention for ext4 both with and without a journal. Although the long worst case wait time when write locking j_state_lock in the journaled case may be worth investigating from a latency perspective, improvement may not greatly improve scalability given the modest gain observed from running without a journal at all.
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
ffsb reports much higher throughput rates for this workload as compared to the others. We think this may be due to a relatively low amount of generated I/O relative to the amount of cache on the RAID controller. The test system also has a lot of memory for page cache, and that should tend to convert a signficant portion of the generated random I/O into more sequential write patterns than we might want.
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
The strong throughput numbers btrfs delivers for this workload appear to be a function of an internally set 4 MB read ahead value. If a similarly aggressive read ahead value is set for the block devices used for ext4 or xfs filesystems, overriding the 128 KB default, similar throughput levels can be obtained.
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
This workload appears to be limited by the throughput of the storage system, so it doesn't reveal anything particularly interesting about ext4.
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
The mail server workload performs both reads and writes during execution. The read and write throughput results are broken out in separate graphs. Because it's not possible to isolate the CPU utilization for the read operations from the write operations, aggregate utilization (cost) is reported in a single graph rather than in separate graphs for the read and write cases.
Unlike the other workloads, the mail server workload historically (3.2 and earlier) tended to exhibit a lot of run to run variability. However, the 3.6-rc3 data collected don't show this, and it's also not particularly evident when comparing 3.6-rc3 data with the 3.2 results.
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
1 thread - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
48 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.2:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs
192 threads - 3.6-rc3:
ext4,
ext4 nojournal,
ext3,
xfs,
btrfs