Optimizing the EXT3 file system on CentOS

Ext3 is a very capable file system with excellent fault tolerance and a long track record of stability. While it performs well, it's by no means the fastest file system out there. There are some things you can do to give ext3 a boost when you just want speed.

Some of the methods listed here will reduce the information kept about your file system as a trade-off for speed. Not all users will see gains from these methods, as it really does depend on the type of I/O access you have. Please take some time to identify your I/O requirements before trying these optimization methods

1. Mount Options

1.1. noatime , nodiratime

This is one of the quickest and easiest performance gains. This mount option tells the system not to update inode access times. This is a good option for web servers, news servers or other uses with high access file systems. Example:

/dev/VolGroup00/LogVol00 /                       ext3    defaults,noatime        1 1

We note in man mount, an additional option: nodiratime but have not performed time trials to see if adding it can make a material (or measurable) difference in performance.

1.2. commit

This file system option controls how often the file system is told to sync data and meta data. The default value is 5 seconds, but you can extend this for a performance gain. The downside is that if your system loses power or crashes without writing out data, you could lose up that time value's worth of data. The values you set here are entirely up to you based on the performance of your system.

/dev/VolGroup00/LogVol00 /                       ext3    defaults,commit=120     1 1

1.3. data

This one has 3 separate options for you to choose from. When other journaled file systems like XFS and JFS write meta data to the disk, they do just that. Ext3 goes the extra mile to protect your files, and writes the data associated with that meta data by default. This is basically the idea behind the 'data=ordered' method, which writes to the main file system before committing to the journal.

To make ext3 behave like XFS and other file systems, set 'data=writeback' in your mount options. The writeback mode does not preserve data ordering when writing to the disk, so commits to the journal may happen before the file system is written to. This method is faster because only the meta data is journaled, but is not quite as neurotic about protecting your data as the default.

The last data option, journal, is pretty much the polar opposite of the ordered option, forcing the data to write to the journal first, and then to the file system. This mode is usually the slowest, but can outperform the other options in limited cases where you need to read from AND write to the disk at the same time. As always, other people don't have exactly the same needs you do, so their benchmarks are a guide, not a rule. Play around and see which options work best for you.

/dev/VolGroup00/LogVol00 /home                   ext3    defaults,data=writeback  1 1

Attention: To use any other mode than 'data=ordered' on the root file system, you have to pass the mode to the kernel as a boot parameter, by adding it to the kernel command line: rootflags=data=writeback.

2. Disk Elevators

CentOS4 has 4 disk elevators, which are there to minimize head seek by re-ordering and merging requests to read or write data from common areas of the disk. These options offer performance increases, but speed boosts may not be as pronounced on systems using RAID, as they do not take spindle striping into account.

A good explanation of Elevator options can be found in the June 2005 Redhat Magazine

3. Raid Math

The biggest performance gain you can achieve on a raid array is to make sure you format the volume aligned to your raid stripe size. This is referred to as the stride. By setting up the file system in such a way that the writes match the raid layout, you avoid overlap calculations and adjustments on the file system, and make it easier for the system to write out to the disk. The net result is that your system is able to write things faster, and you get better performance. To understand how the stride math actually works, you need to know a couple things about the raid setup you're using.

Type of RAID you're doing to use (RAID 1,5,6,10 etc)
The number of data-bearing disks in the array
The chunk size of the RAID array
And lastly, you need to know the file system block size (4k blocks for ext3 for example).

The drive calculation works like this: You divide the chunk size by the block size for one spindle/drive only. This gives you your stride size. Then you take the stride size, and multiply it by the number of data-bearing disks in the RAID array. This gives you the stripe width to use when formatting the volume. This can be a little complex, so some examples are listed below.

For example if you have 4 drives in RAID5 and it is using 64K chunks and given a 4K file system block size. The stride size is calculated for the one disk by (chunk size / block size), (64K/4K) which gives 16. While the stripe width for RAID5 is 1 disk less, so we have 3 data-bearing disks out of the 4 in this RAID5 group, which gives us (number of data-bearing drives * stride size), (3*16) gives you a stripe width of 48.

When you create an ext3 partition in this manner, you would format it like this

mkfs.ext3 -b 4096 -E stride=16 -E stripe-width=48 -O dir_index /dev/XXXX

Sadly, the 'stripe-width=' extended-option has disappeared from man mkfs.ext3 as of CentOS 5.3

The dir_index listed above is the last tweak mentioned here. The dir_index option allows ext3 to use hashed b-trees to speed up look ups in large directories. It's not a big gain, but it will help.

If it was a 4 disk RAID10 array, than it would be a stripe width of (16+16) = 32 as each pair of disks are mirrored effectively being 1 disk with redundancy should any one fail, then striped between the two RAID1 sets to make RAID10.