OCFS2 is a file system. It allows users to store and retrieve data. The data is stored in files that are organized in a hierarchical directory tree. It is a POSIX compliant file system that supports the standard interfaces and the behavioral semantics as spelled out by that specification.
It is also a shared disk cluster file system, one that allows multiple nodes to access the same disk at the same time. This is where the fun begins as allowing a file system to be accessible on multiple nodes opens a can of worms. What if the nodes are of different architectures? What if a node dies while writing to the file system? What data consistency can one expect if processes on two nodes are reading and writing concurrently? What if one node removes a file while it is still being used on another node?
Unlike most shared file systems where the answer is fuzzy, the answer in OCFS2 is very well defined. It behaves on all nodes exactly like a local file system. If a file is removed, the directory entry is removed but the inode is kept as long as it is in use across the cluster. When the last user closes the descriptor, the inode is marked for deletion.
The data consistency model follows the same principle. It works as if the two processes that are running on two different nodes are running on the same node. A read on a node gets the last write irrespective of the IO mode used. The modes can be buffered, direct, asynchronous, splice or memory mapped IOs. It is fully cache coherent.
Take for example the REFLINK feature that allows a user to create multiple write-able snapshots of a file. This feature, like all others, is fully cluster-aware. A file being written to on multiple nodes can be safely reflinked on another. The snapshot created is a point-in-time image of the file that includes both the file data and all its attributes (including extended attributes).
It is a journaling file system. When a node dies, a surviving node transparently replays the journal of the dead node. This ensures that the file system metadata is always consistent. It also defaults to ordered data journaling to ensure the file data is flushed to disk before the journal commit, to remove the small possibility of stale data appearing in files after a crash.
It is architecture and endian neutral. It allows concurrent mounts on nodes with different processors like x86, x86_64, IA64 and PPC64. It handles little and big endian, 32-bit and 64-bit architectures.
It is feature rich. It supports indexed directories, metadata checksums, extended attributes, POSIX ACLs, quotas, REFLINKs, sparse files, unwritten extents and inline-data.
It is fully integrated with the mainline Linux kernel. The file system was merged into Linux kernel 2.6.16 in early 2006.
It is quickly installed. It is available with almost all Linux distributions. The file system is on-disk compatible across all of them.
It is modular. The file system can be configured to operate with other cluster stacks like Pacemaker and CMAN along with its own stack, O2CB.
It is easily configured. The O2CB cluster stack configuration involves editing two files, one for cluster layout and the other for cluster timeouts.
It is very efficient. The file system consumes very little resources. It is used to store virtual machine images in limited memory environments like Xen and KVM.
In summary, OCFS2 is an efficient, easily configured, modular, quickly installed, fully integrated and compatible, feature-rich, architecture and endian neutral, cache coherent, ordered data journaling, POSIX-compliant, shared disk cluster file system.
OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing both high performance and high availability.
As it provides local file system semantics, it can be used with almost all applications. Cluster-aware applications can make use of cache-coherent parallel I/Os from multiple nodes to scale out applications easily. Other applications can make use of the clustering facilities to fail-over running application in the event of a node failure.
The notable features of the file system are:
The support for o2cb cluster stack is available in all releases.
The support for no cluster stack, or local mount, was added in Linux kernel 2.6.20.
The support for userspace cluster stack was added in Linux kernel 2.6.26.
With allocation reservation, the file system reserves a window in the bitmap for all extending files allowing each to grow as contiguously as possible. As this extra space is not actually allocated, it is available for use by other files if the need arises. This feature was added in Linux kernel 2.6.35 and can be tuned using the mount option resv_level.
Security attributes allow the file system to support other security regimes like SELinux, SMACK, AppArmor, etc.
Both these security extensions were added in Linux kernel 2.6.29 and requires enabling on-disk feature xattr.
The support for clustered flock(2) was added in Linux kernel 2.6.26. All flock(2) options are supported, including the kernels ability to cancel a lock request when an appropriate kill signal is received by the user. This feature is supported with all cluster-stacks including o2cb.
The support for clustered fcntl(2) was added in Linux kernel 2.6.28. But because it requires group communication to make the locks coherent, it is only supported with userspace cluster stacks, pcmk and cman and not with the default cluster stack o2cb.
The O2CB cluster stack has a global heartbeat mode. It allows users to specify heartbeat regions that are consistent across all nodes. The cluster stack also allows online addition and removal of both nodes and heartbeat regions.
o2cb(8) is the new cluster configuration utility. It is an easy to use utility that allows users to create the cluster configuration on a node that is not part of the cluster. It replaces the older utility o2cb_ctl(8) which has being deprecated.
ocfs2console(8) has been obsoleted.
o2info(8) is a new utility that can be used to provide file system information. It allows non-privileged users to see the enabled file system features, block and cluster sizes, extended file stat, free space fragmentation, etc.
o2hbmonitor(8) is a o2hb heartbeat monitor. It is an extremely light weight utility that logs messages to the system logger once the heartbeat delay exceeds the warn threshold. This utility is useful in identifying volumes encountering I/O delays.
debugfs.ocfs2(8) has some new commands. net_stats shows the o2net message times between various nodes. This is useful in identifying nodes are that slowing down the cluster operations. stat_sysdir allows the user to dump the entire system directory that can be used to debug issues. grpextents dumps the complete free space fragmentation in the cluster group allocator.
mkfs.ocfs2(8) now enables xattr, indexed-dirs, discontig-bg, refcount, extended-slotmap and clusterinfo feature flags by default, in addition to the older defaults, sparse, unwritten and inline-data.
mount.ocfs2(8) allows users to specify the level of cache coherency between nodes. By default the file system operates in full coherency mode that also serializes the direct I/Os. While this mode is technically correct, it limits the I/O thruput in a clustered database. This mount option allows the user to limit the cache coherency to only the buffered I/Os to allow multiple nodes to do concurrent direct writes to the same file. This feature works with Linux kernel 2.6.37 and later.
The OCFS2 development teams goes to great lengths to maintain compatibility. It attempts to maintain both on-disk and network protocol compatibility across all releases of the file system. It does so even while adding new features that entail on-disk format and network protocol changes. To do this successfully, it follows a few rules:
1. The on-disk format changes are managed by a set of feature flags that can be turned on and off. The file system in kernel detects these features during mount and continues only if it understands all the features. Users encountering this have the option of either disabling that feature or upgrading the file system to a newer release.
2. The latest release of ocfs2-tools is compatible with all versions of the file system. All utilities detect the features enabled on disk and continue only if it understands all the features. Users encountering this have to upgrade the tools to a newer release.
3. The network protocol version is negotiated by the nodes to ensure all nodes understand the active protocol version.
Compat, or compatible, is a feature that the file system does not need to fully understand to safely read/write to the volume. An example of this is the backup-super feature that added the capability to backup the super block in multiple locations in the file system. As the backup super blocks are typically not read nor written to by the file system, an older file system can safely mount a volume with this feature enabled.
Incompat, or incompatible, is a feature that the file system needs to fully understand to read/write to the volume. Most features fall under this category.
RO Compat, or read-only compatible, is a feature that the file system needs to fully understand to write to the volume. Older software can safely read a volume with this feature enabled. An example of this would be user and group quotas. As quotas are manipulated only when the file system is written to, older software can safely mount such volumes in read-only mode.
The list of feature flags, the version of the kernel it was added in, the earliest version of the tools that understands it, etc., is as follows:
Feature Flags | Kernel Version | Tools Version | Category | Hex Value |
backup-super | All | ocfs2-tools 1.2 | Compat | 1 |
strict-journal-super | All | All | Compat | 2 |
local | Linux 2.6.20 | ocfs2-tools 1.2 | Incompat | 8 |
sparse | Linux 2.6.22 | ocfs2-tools 1.4 | Incompat | 10 |
inline-data | Linux 2.6.24 | ocfs2-tools 1.4 | Incompat | 40 |
extended-slotmap | Linux 2.6.27 | ocfs2-tools 1.6 | Incompat | 100 |
xattr | Linux 2.6.29 | ocfs2-tools 1.6 | Incompat | 200 |
indexed-dirs | Linux 2.6.30 | ocfs2-tools 1.6 | Incompat | 400 |
metaecc | Linux 2.6.29 | ocfs2-tools 1.6 | Incompat | 800 |
refcount | Linux 2.6.32 | ocfs2-tools 1.6 | Incompat | 1000 |
discontig-bg | Linux 2.6.35 | ocfs2-tools 1.6 | Incompat | 2000 |
clusterinfo | Linux 2.6.37 | ocfs2-tools 1.8 | Incompat | 4000 |
unwritten | Linux 2.6.23 | ocfs2-tools 1.4 | RO Compat | 1 |
grpquota | Linux 2.6.29 | ocfs2-tools 1.6 | RO Compat | 2 |
usrquota | Linux 2.6.29 | ocfs2-tools 1.6 | RO Compat | 4 |
To query the features enabled on a volume, do:
$ o2info --fs-features /dev/sdf1 backup-super strict-journal-super sparse extended-slotmap inline-data xattr indexed-dirs refcount discontig-bg clusterinfo unwritten
The format utility, mkfs.ocfs2(8), allows a user to enable and disable specific features using the fs-features option. The features are provided as a comma separated list. The enabled features are listed as is. The disabled features are prefixed with no. The example below shows the file system being formatted with sparse disabled and inline-data enabled.
# mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1
After formatting, the users can toggle features using the tune utility, tunefs.ocfs2(8). This is an offline operation. The volume needs to be umounted across the cluster. The example below shows the sparse feature being enabled and inline-data disabled.
# tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1
Care should be taken before enabling and disabling features. Users planning to use a volume with an older version of the file system will be better of not enabling newer features as turning disabling may not succeed.
An example would be disabling the sparse feature; this requires filling every hole. The operation can only succeed if the file system has enough free space.
Say one tries to mount a volume with an incompatible feature. What happens then? How does one detect the problem? How does one know the name of that incompatible feature?
To begin with, one should look for error messages in dmesg(8). Mount failures that are due to an incompatible feature will always result in an error message like the following:
ERROR: couldn't mount because of unsupported optional features (200).
Here the file system is unable to mount the volume due to an unsupported optional feature. That means that that feature is an Incompat feature. By referring to the table above, one can then deduce that the user failed to mount a volume with the xattr feature enabled. (The value in the error message is in hexadecimal.)
Another example of an error message due to incompatibility is as follows:
ERROR: couldn't mount RDWR because of unsupported optional features (1).
Here the file system is unable to mount the volume in the RW mode. That means that that feature is a RO Compat feature. Another look at the table and it becomes apparent that the volume had the unwritten feature enabled.
In both cases, the user has the option of disabling the feature. In the second case, the user has the choice of mounting the volume in the RO mode.
The OCFS2 software is split into two components, namely, kernel and tools. The kernel component includes the core file system and the cluster stack, and is packaged along with the kernel. The tools component is packaged as ocfs2-tools and needs to be specifically installed. It provides utilities to format, tune, mount, debug and check the file system.
To install ocfs2-tools, refer to the package handling utility in in your distributions.
The next step is selecting a cluster stack. The options include:
A. No cluster stack, or local mount.
B. In-kernel o2cb cluster stack with local or global heartbeat.
C. Userspace cluster stacks pcmk or cman.
The file system allows changing cluster stacks easily using tunefs.ocfs2(8). To list the cluster stacks stamped on the OCFS2 volumes, do:
# mounted.ocfs2 -d Device Stack Cluster F UUID Label /dev/sdb1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hbvol1 /dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount /dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol /dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol /dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch
To format a OCFS2 volume as a non-clustered (local) volume, do:
# mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1
To convert an existing clustered volume to a non-clustered volume, do:
# tunefs.ocfs2 --fs-features=local /dev/sda1
Non-clustered volumes do not interact with the cluster stack. One can have both clustered and non-clustered volumes mounted at the same time.
While formatting a non-clustered volume, users should consider the possibility of later converting that volume to a clustered one. If there is a possibility of that, then the user should add enough node-slots using the -N option. Adding node-slots during format creates journals with large extents. If created later, then the journals will be fragmented which is not good for performance.
Only one of the two heartbeat mode can be active at any one time. Changing heartbeat modes is an offline operation.
Both heartbeat modes require /etc/ocfs2/cluster.conf and /etc/sysconfig/o2cb to be populated as described in ocfs2.cluster.conf(5) and o2cb.sysconfig(5) respectively. The only difference in set up between the two modes is that global requires heartbeat devices to be configured whereas local does not.
Refer o2cb(7) for more information.
Once configured, the o2cb cluster stack can be onlined and offlined as follows:
# service o2cb online Setting cluster stack "o2cb": OK Registering O2CB cluster "webcluster": OK Setting O2CB cluster timeouts : OK # service o2cb offline Clean userdlm domains: OK Stopping O2CB cluster webcluster: OK Unregistering O2CB cluster "webcluster": OK
These heartbeat devices are OCFS2 formatted volumes with global heartbeat enabled on disk. These volumes can later be mounted and used as clustered file systems.
The steps to format a volume with global heartbeat enabled is listed in o2cb(7). Also listed there is listing all volumes with the cluster stack stamped on disk.
In this mode, the heartbeat is started when the cluster is onlined and stopped when the cluster is offlined.
# service o2cb online Setting cluster stack "o2cb": OK Registering O2CB cluster "webcluster": OK Setting O2CB cluster timeouts : OK Starting global heartbeat for cluster "webcluster": OK # service o2cb offline Clean userdlm domains: OK Stopping global heartbeat on cluster "webcluster": OK Stopping O2CB cluster webcluster: OK Unregistering O2CB cluster "webcluster": OK # service o2cb status Driver for "configfs": Loaded Filesystem "configfs": Mounted Stack glue driver: Loaded Stack plugin "o2cb": Loaded Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster "webcluster": Online Heartbeat dead threshold: 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Heartbeat mode: Global Checking O2CB heartbeat: Active 77D95EF51C0149D2823674FCC162CF8B /dev/sdg1 Nodes in O2CB cluster: 92 96
Configure and online the userspace stack pcmk or cman before using tunefs.ocfs2(8) to update the cluster stack on disk.
# tunefs.ocfs2 --update-cluster-stack /dev/sdd1 Updating on-disk cluster information to match the running cluster. DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION. Update the on-disk cluster information? y
Refer to the cluster stack documentation for information on starting and stopping the cluster stack.
This sections lists the utilities that are used to manage the OCFS2 file systems. This includes tools to format, tune, check, mount, debug the file system. Each utility has a man page that lists its capabilities in detail.
As a precaution, the utility will abort if the volume is locally mounted. It also detects use across the cluster if used by OCFS2. But these checks are not comprehensive and can be overridden. So use it with care.
While it is not always required, the cluster should be online.
This utility requires the cluster to be online.
This utility requires the cluster to be online to ensure the volume is not in use on another node and to prevent the volume from being mounted for the duration of the check.
This utility detects the cluster status and aborts if the cluster is offline or does not match the cluster stamped on disk.
This utility only updates the disk if the utility is reasonably assured that the file system is not in use on any node.
It can be used by both privileged and non-privileged users. Users having read permission on the device can provide the path to the device. Other users can provide the path to a file on a mounted file system.
This utility requires the user to have read permission on the device.
The image file created can be used in debugging on-disk corruptions.
This sections lists the utilities that are used to manage O2CB cluster stack. Each utility has a man page that lists its capabilities in detail.
This is a new utility and replaces o2cb_ctl(8) which has been deprecated.
This section includes some useful notes that may prove helpful to the user.
The standard recommendation for such clusters is to have identical hardware and software across all the nodes. However, that is not a hard and fast rule. After all, we have taken the effort to ensure that OCFS2 works in a mixed architecture environment.
If one uses OCFS2 in a mixed architecture environment, try to ensure that the nodes are equally powered and loaded. The use of a load balancer can assist with the latter. Power refers to the number of processors, speed, amount of memory, I/O throughput, network bandwidth, etc. In reality, having equally powered heterogeneous nodes is not always practical. In that case, make the lower node numbers more powerful than the higher node numbers. The O2CB cluster stack favors lower node numbers in all of its tiebreaking logic.
This is not to suggest you should add a single core node in a cluster of quad cores. No amount of node number juggling will help you there.
First is the hard link count, that indicates the number of directory entries pointing to that inode. As long as an inode has one or more directory entries pointing to it, it cannot be deleted. The file system has to wait for the removal of all those directory entries. In other words, wait for that count to drop to zero.
The second hurdle is the POSIX semantics allowing files to be unlinked even while they are in-use. In OCFS2, that translates to in-use across the cluster. The file system has to wait for all processes across the cluster to stop using the inode.
Once these conditions are met, the inode is deleted and the freed space is visible after the next sync.
Now the amount of space freed depends on the allocation. Only space that is actually allocated to that inode is freed. The example below shows a sparsely allocated file of size 51TB of which only 2.4GB is actually allocated.
$ ls -lsh largefile 2.4G -rw-r--r-- 1 mark mark 51T Sep 29 15:04 largefile
Furthermore, for reflinked files, only private extents are freed. Shared extents are freed when the last inode accessing it, is deleted. The example below shows a 4GB file that shares 3GB with other reflinked files. Deleting it will increase the free space by 1GB. However, if it is the only remaining file accessing the shared extents, the full 4G will be freed. (More information on the shared-du(1) utility is provided below.)
$ shared-du -m -c --shared-size reflinkedfile 4000 (3000) reflinkedfile
The deletion itself is a multi-step process. Once the hard link count falls to zero, the inode is moved to the orphan_dir system directory where it remains until the last process, across the cluster, stops using the inode. Then the file system frees the extents and adds the freed space count to the truncate_log system file where it remains until the next sync. The freed space is made visible to the user only after that sync.
A hot cache stat(2), on the other hand, has shown to perform on OCFS2 like it does on EXT3.
In other words, the second ls(1) will be quicker than the first. However, it is not guaranteed. Say you have a million files in a file system and not enough kernel memory to cache all the inodes. In that case, each ls(1) will involve some cold cache stat(2)s.
$ for i in $(seq 0 99); > do > for j in $(seq 4); > do > dd if=/dev/zero of=file$j bs=4K count=1 seek=$i; > done; > done;
When run on a system running Linux kernel 2.6.34 or earlier, we end up with files with 100 extents each. That is full fragmentation. As the files are being extended one after another, the on-disk allocations are fully interleaved.
$ filefrag file1 file2 file3 file4 file1: 100 extents found file2: 100 extents found file3: 100 extents found file4: 100 extents found
When run on a system running Linux kernel 2.6.35 or later, we see files with 7 extents each. That is a lot fewer than before. Fewer extents mean more on-disk contiguity and that always leads to better overall performance.
$ filefrag file1 file2 file3 file4 file1: 7 extents found file2: 7 extents found file3: 7 extents found file4: 7 extents found
$ ls -l total 5120000 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile $ du -m myfile* 5000 myfile $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/sdd1 50G 8.2G 42G 17% /ocfs2
If we were to reflink it 4 times, we would expect the directory listing to report five 5GB files, but the df(1) to report no loss of available space. du(1), on the other hand, would report the disk usage to climb to 25GB.
$ reflink myfile myfile-ref1 $ reflink myfile myfile-ref2 $ reflink myfile myfile-ref3 $ reflink myfile myfile-ref4 $ ls -l total 25600000 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref1 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref2 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref3 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref4 $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/sdd1 50G 8.2G 42G 17% /ocfs2 $ du -m myfile* 5000 myfile 5000 myfile-ref1 5000 myfile-ref2 5000 myfile-ref3 5000 myfile-ref4 25000 total
Enter shared-du(1), a shared extent-aware du. This utility reports the shared extents per file in parenthesis and the overall footprint. As expected, it lists the overall footprint at 5GB. One can view the details of the extents using shared-filefrag(1). Both these utilities are available at http://oss.oracle.com/~smushran/reflink-tools/. We are currently in the process of pushing the changes to the upstream maintainers of these utilities.
$ shared-du -m -c --shared-size myfile* 5000 (5000) myfile 5000 (5000) myfile-ref1 5000 (5000) myfile-ref2 5000 (5000) myfile-ref3 5000 (5000) myfile-ref4 25000 total 5000 footprint # shared-filefrag -v myfile Filesystem type is: 7461636f File size of myfile is 5242880000 (1280000 blocks, blocksize 4096) ext logical physical expected length flags 0 0 2247937 8448 1 8448 2257921 2256384 30720 2 39168 2290177 2288640 30720 3 69888 2322433 2320896 30720 4 100608 2354689 2353152 30720 7 192768 2451457 2449920 30720 . . . 37 1073408 2032129 2030592 30720 shared 38 1104128 2064385 2062848 30720 shared 39 1134848 2096641 2095104 30720 shared 40 1165568 2128897 2127360 30720 shared 41 1196288 2161153 2159616 30720 shared 42 1227008 2193409 2191872 30720 shared 43 1257728 2225665 2224128 22272 shared,eof myfile: 44 extents found
A simple test to check the data coherency of a shared file system involves concurrently appending the same file. Like running "uname -a >>/dir/file" using a parallel distributed shell like dsh or pconsole. If coherent, the file will contain the results from all nodes.
# dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test" # cat /ocfs2/test Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
OCFS2 is a fully cache coherent cluster file system.
However, this dynamic allocation has been problematic when the free space is very fragmented, because the file system required the inode and extent allocators to grow in contiguous fixed-size chunks.
The discontiguous block group feature takes care of this problem by allowing the allocators to grow in smaller, variable-sized chunks.
This feature was added in Linux kernel 2.6.35 and requires enabling on-disk feature discontig-bg.
Backup super blocks are copies of the super block. These blocks are dispersed in the volume to minimize the chances of being overwritten. On the small chance that the original gets corrupted, the backups are available to scan and fix the corruption.
mkfs.ocfs2(8) enables this feature by default. Users can disable this by specifying --fs-features=nobackup-super during format.
o2info(1) can be used to view whether the feature has been enabled on a device.
# o2info --fs-features /dev/sdb1 backup-super strict-journal-super sparse extended-slotmap inline-data xattr indexed-dirs refcount discontig-bg clusterinfo unwritten
In OCFS2, the super block is on the third block. The backups are located at the 1G, 4G, 16G, 64G, 256G and 1T byte offsets. The actual number of backup blocks depends on the size of the device. The super block is not backed up on devices smaller than 1GB.
fsck.ocfs2(8) refers to these six offsets by numbers, 1 to 6. Users can specify any backup with the -r option to recover the volume. The example below uses the second backup. If successful, fsck.ocfs2(8) overwrites the corrupted super block with the backup.
# fsck.ocfs2 -f -r 2 /dev/sdb1 fsck.ocfs2 1.8.0 [RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y Checking OCFS2 filesystem in /dev/sdb1: Label: webhome UUID: B3E021A2A12B4D0EB08E9E986CDC7947 Number of blocks: 13107196 Block size: 4096 Number of clusters: 13107196 Cluster size: 4096 Number of slots: 8 /dev/sdb1 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Pass 2: Checking directory entries. Pass 3: Checking directory connectivity. Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes link counts. All passes succeeded.
A better model is one where all nodes manage a subset of the lock resources. Each node maintains enough information for all the lock resources it is interested in. On event of a node death, the remaining nodes pool in the information to reconstruct the lock state maintained by the dead node. In this scheme, the locking overhead is distributed amongst all the nodes. Hence, the term distributed lock manager.
O2DLM is a distributed lock manager. It is based on the specification titled "Programming Locking Application" written by Kristin Thomas and is available at the following link. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
# cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001 Key: 0x10748e61 Thread Pid: 24542 Node: 7 State: JOINED Number of Joins: 1 Joining Node: 255 Domain Map: 7 31 33 34 40 50 Live Map: 7 31 33 34 40 50 Lock Resources: 48850 (439879) MLEs: 0 (1428625) Blocking: 0 (1066000) Mastery: 0 (362625) Migration: 0 (0) Lists: Dirty=Empty Purge=Empty PendingASTs=Empty PendingBASTs=Empty Purge Count: 0 Refs: 1 Dead Node: 12 Recovery Pid: 24543 Master: 7 State: ACTIVE Recovery Map: 12 32 35 Recovery Node State: 7 - DONE 31 - DONE 33 - DONE 34 - DONE 40 - DONE 50 - DONE
The figure below shows the state of a dlm lock resource that is mastered (owned) by node 25, with 6 locks in the granted queue and node 26 holding the EX (writelock) lock on that resource.
# debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1 Lockres: M000000000000000022d63c00000000 Owner: 25 State: 0x0 Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No Refs: 8 Locks: 6 On Lists: None Reference Map: 26 27 28 94 95 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Granted 94 NL -1 94:3169409 2 No No None Granted 28 NL -1 28:3213591 2 No No None Granted 27 NL -1 27:3216832 2 No No None Granted 95 NL -1 95:3178429 2 No No None Granted 25 NL -1 25:3513994 2 No No None Granted 26 EX -1 26:3512906 2 No No None
The figure below shows a lock from the file system perspective. Specifically, it shows a lock that is in the process of being upconverted from a NL to EX. Locks in this state are are referred to in the file system as busy locks and can be listed using the debugfs.ocfs2 command, "fs_locks -B".
# debugfs.ocfs2 -R "fs_locks -B" /dev/sda1 Lockres: M000000000000000000000b9aba12ec Mode: No Lock Flags: Initialized Attached Busy RO Holders: 0 EX Holders: 0 Pending Action: Convert Pending Unlock Action: None Requested Mode: Exclusive Blocking Mode: No Lock PR > Gets: 0 Fails: 0 Waits Total: 0us Max: 0us Avg: 0ns EX > Gets: 1 Fails: 0 Waits Total: 544us Max: 544us Avg: 544185ns Disk Refreshes: 1
With this debugging infrastructure in place, users can debug hang issues as follows:
* Dump the busy fs locks for all the OCFS2 volumes on the node with hanging processes. If no locks are found, then the problem is not related to O2DLM.
* Dump the corresponding dlm lock for all the busy fs locks. Note down the owner (master) of all the locks.
* Dump the dlm locks on the master node for each lock.
At this stage, one should note that the hanging node is waiting to get an AST from the master. The master, on the other hand, cannot send the AST until the current holder has down converted that lock, which it will do upon receiving a Blocking AST. However, a node can only down convert if all the lock holders have stopped using that lock. After dumping the dlm lock on the master node, identify the current lock holder and dump both the dlm and fs locks on that node.
The trick here is to see whether the Blocking AST message has been relayed to file system. If not, the problem is in the dlm layer. If it has, then the most common reason would be a lock holder, the count for which is maintained in the fs lock.
At this stage, printing the list of process helps.
$ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
Make a note of all D state processes. At least one of them is responsible for the hang on the first node.
The challenge then is to figure out why those processes are hanging. Failing that, at least get enough information (like alt-sysrq t output) for the kernel developers to review. What to do next depends on where the process is hanging. If it is waiting for the I/O to complete, the problem could be anywhere in the I/O subsystem, from the block device layer through the drivers to the disk array. If the hang concerns a user lock (flock(2)), the problem could be in the user’s application. A possible solution could be to kill the holder. If the hang is due to tight or fragmented memory, free up some memory by killing non-essential processes.
The thing to note is that the symptom for the problem was on one node but the cause is on another. The issue can only be resolved on the node holding the lock. Sometimes, the best solution will be to reset that node. Once killed, the O2DLM recovery process will clear all locks owned by the dead node and let the cluster continue to operate. As harsh as that sounds, at times it is the only solution. The good news is that, by following the trail, you now have enough information to file a bug and get the real issue resolved.
If the version of the Linux kernel on the system exporting the volume is older than 2.6.30, then the NFS clients must mount the volumes using the nordirplus mount option. This disables the READDIRPLUS RPC call to workaround a bug in NFSD, detailed in the following link:
http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
Users running NFS version 2 can export the volume after having disabled subtree checking (mount option no_subtree_check). Be warned, disabling the check has security implications (documented in the exports(5) man page) that users must evaluate on their own.
To list the system directory (referred to as double-slash), do:
# debugfs.ocfs2 -R "ls -l //" /dev/sde1 66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 . 66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 .. 67 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 bad_blocks 68 -rw-r--r-- 1 0 0 1179648 19-Jul-2011 13:36 global_inode_alloc 69 -rw-r--r-- 1 0 0 4096 19-Jul-2011 14:35 slot_map 70 -rw-r--r-- 1 0 0 1048576 19-Jul-2011 13:36 heartbeat 71 -rw-r--r-- 1 0 0 53686960128 19-Jul-2011 13:36 global_bitmap 72 drwxr-xr-x 2 0 0 3896 25-Jul-2011 15:05 orphan_dir:0000 73 drwxr-xr-x 2 0 0 3896 19-Jul-2011 13:36 orphan_dir:0001 74 -rw-r--r-- 1 0 0 8388608 19-Jul-2011 13:36 extent_alloc:0000 75 -rw-r--r-- 1 0 0 8388608 19-Jul-2011 13:36 extent_alloc:0001 76 -rw-r--r-- 1 0 0 121634816 19-Jul-2011 13:36 inode_alloc:0000 77 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 inode_alloc:0001 77 -rw-r--r-- 1 0 0 268435456 19-Jul-2011 13:36 journal:0000 79 -rw-r--r-- 1 0 0 268435456 19-Jul-2011 13:37 journal:0001 80 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 local_alloc:0000 81 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 local_alloc:0001 82 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 truncate_log:0000 83 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 truncate_log:0001
The file names that end with numbers are slot specific and are referred to as node-local system files. The set of node-local files used by a node can be determined from the slot map. To list the slot map, do:
# debugfs.ocfs2 -R "slotmap" /dev/sde1 Slot# Node# 0 32 1 35 2 40 3 31 4 34 5 33
For more information, refer to the OCFS2 support guides available in the Documentation section at http://oss.oracle.com/projects/ocfs2.
o2hb is the disk heartbeat component of o2cb. It periodically updates a timestamp on disk, indicating to others that this node is alive. It also reads all the timestamps to identify other live nodes. Other cluster components, like o2dlm and o2net, use the o2hb service to get node up and down events.
The quorum is the group of nodes in a cluster that is allowed to operate on the shared storage. When there is a failure in the cluster, nodes may be split into groups that can communicate in their groups and with the shared storage but not between groups. o2quo determines which group is allowed to continue and initiates fencing of the other group(s).
Fencing is the act of forcefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it does not have quorum in a degraded cluster. It does this so that other nodes won’t be stuck trying to access its resources.
o2cb uses a machine reset to fence. This is the quickest route for the node to rejoin the cluster.
We are currently looking to add features like transparent compression, transparent encryption, delayed allocation, multi-device support, etc. as well as work on improving performance on newer generation machines.
If you are interested in contributing, email the development team at ocfs2-devel@oss.oracle.com.
The principal developers of the OCFS2 file system, its tools and the O2CB cluster stack, are Joel Becker, Zach Brown, Mark Fasheh, Jan Kara, Kurt Hackel, Tao Ma, Sunil Mushran, Tiger Yang and Tristan Ye.
Other developers who have contributed to the file system via bug fixes, testing, etc. are Wim Coekaerts, Srinivas Eeda, Coly Li, Jeff Mahoney, Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wengang Wang.
The members of the Linux Cluster community including Andrew Beekhof, Lars Marowsky-Bree, Fabio Massimo Di Nitto and David Teigland.
The members of the Linux File system community including Christoph Hellwig and Chris Mason.
The corporations that have contributed resources for this project including Oracle, SUSE Labs, EMC, Emulex, HP, IBM, Intel and Network Appliance.