A replicated filesystem proposal for FreeBSD

FreeBSD is a favourite operating system of mine. I’ve been using it for several years, starting with 5.0. One of the things that was new with 5.0 was GEOM – a disk infrastructure which is more flexible than the previous system.

The number of geom providers has expanded over the years, to about 20 today, providing striping, mirroring, encryption, multipathing and plenty of other things (some without man pages, so I don’t know what they do).

The ability to use different geoms together mean that we can combine them to have (for example) encrypted striped mirrors.

Mirroring (RAID1) is commonly used to protect against disk failure, but only on a single machine. Gianpaolo Del Matto has been ambitious and combined gmirror and ggate (which allows you to access devices on another machine) to mirror a filesystem between two systems. However in the event of a device failure or a network problem, the affected device needs to be removed and reinserted into the mirror. During the rebuild process the mirror is useless. It is also only suitable for fast interconnecting networks. If you have a slow network connection between the two servers, and a very large amount of data, a network interruption will require a rebuild that may take a considerable amount of time.

So what about if our two copies are on opposite sides of the world? What if the interconnecting network is slow? What if a large amount of writes are made all in one go?

As far as I can see, there’s nothing for FreeBSD that provides the ability to have filesystem replication between geographically separate servers. (rsync scripts and so on don’t provide instant updates). There is an option that performs this function for AIX, HP-UX, Solaris, Red Hat Enterprise Linux, SuSE Linux and even Windows. Symantec Veritas Volume Replicator (also known as VVR). Unsurprisingly it costs money. Quite a lot of it.

It seems that FreeBSD may have an advantage over other operating systems if it could replicate (at least the core functionality of) VVR. And given that the geom framework exists, along with gmirror, gjournal and ggate, it seems that (relatively) it wouldn’t be too hard to add that.

Terminology

Note that in this article, write-order fidelity means that writes are applied to the slave in the same order as they were on the master. The slave being consistent means that write-order fidelity has been maintained, so the slave represents the master as it was at some point in time. If write-order fidelity is not maintained, filesystem corruption and data loss may occur.

Syncronous/asynchronous refers to the replication mode between the master and the slave. It is also commonly used to describe how data is written to disk, but for this article I’ll only use it to describe the replication mode.

What’s so good about VVR?

One of VVR’s chief advantages over gmirror and DRDB (Distributed Replicated Block Device for Linux) is that it can replicate asynchronously, using a log so that network interruptions don’t require a mirror rebuild, and it can then maintain write-order fidelity. That’s not just asynchronous as in delayed by milliseconds, seconds or even minutes, but potentially hours or days. By allowing write-order fidelity to be maintained wherever possible, the slave can be used if the master is destroyed or lost. As soon as you lose write-order fidelity, there are no guarantees that the data on the slave is any use at all.

Why might we want this?

What applications might this be used for? There are plenty of examples. Some examples include having a web application which uses a both a database and file uploads. If you were using a combination of hot standby for your database, and rsync for your uploaded files, then you might end up with a situation where your standby database references a file which doesn’t exist yet on your standby server, or you have have an orphaned file which is not referenced by the database. By using replication which maintains write-order fidelity, the database files and uploaded files could be replicated to another webserver located in a different datacentre. If the main datacentre goes up in smoke, switch the web application to the second datacentre, and with a change of DNS you’ve got global High Availability.

For those who replicate data on a SAN using SAN-based replication, the ability to have the OS replicate data providers removed the need to purchase expensive licenses, and removes lock-in to hardware vendors. (By the time you’ve paid for two disk arrays at each end of a link, plus the replication license, costs can quickly mount up).

VVR is typically used with clustering (such as with Veritas Cluster Server), and while HA clustering might work differently (using jails for example), VVR-like functionality would remove a potential obstacle.

How can we do this with FreeBSD (or another OS)?

If a new geom provider were to be created for this purpose (called for the sake of argument grmirror), then some of the functionality (and presumably code) from gjournal and ggate could be reused. ggate already has a network daemon (ggated), and a client application (ggatec), so the ability to send data over a network is already present in FreeBSD. gjournal already has the concept of data journalling and a separate journal which maintains data consistency.

Let’s have a look at how gjournal works. This section is based on the RELENG_7 source code (which I may or may not interpret correctly) and posts to the freebsd-geom mailing list by Pawel Jakub Dawidek, who wrote gjournal, ggate and gmirror.

gjournal dissection

Before gjournal can be used a gjournal device must be created. If you label a single consumer (such as a bsdlabel slice), gjournal will create a journal data segment at the end on the consumer (1GB unless you explicityly specify the size), and use the rest of the consumer as the provider, with a geom label at the end. If you pass two consumers as attributes to gjournal, the entire second consumer is used for the journal data, and the entire first consumer is the data provider. Again, each consumer has a geom label at the end.

The geom label contains the usual geom data (magic value, version number, journal unique ID, provider type, provider (if hardcoded), provider’s size, MD5 hash), plus some gjournal specific metadata (the start and end offsets of the journal, the last known consistent journal offset, the last known consistent journal ID, journal flags). (From g_journal_metadata in sys/geom/journal/g_journal.h)

When gjournal is running, a journal is created, and a header is written which contains the journal’s ID, and the ID of the next journal. The IDs are randomly generated. For each write, a record header is added, which contains a record header ID, the number of entries, then each entry, with its offsets and length. The size of a journal is limited by how long it will remain open (10 seconds by default), how large it can fill the journal data provider/segment (70% by default), and how many record entries we will allow in a single journal (20 by default). The journal keeps track of how much of the journal provider is in use, and if the provider overflows, it will panic the system. (From g_journal_check_overflow in sys/geom/journal/g_journal.h)

When a journal is closed (because it’s been open long enough, filled enough of the journal provider or had enough writes), then its records are added to a queue to be flushed to disk. When this happens, metadata is updated to indicate copying has started. If optimisation is enabled, the journal data is optimised, then the journal data is sent to the data provider. When the copy has finished, the metadata is updated to indicate no copy is in progress, and the journal_id and offset of the successfully committed journal is stored in the metadata. A second journal is started after the end of the closed journal.

Writes are optimised by only writing the last write to a block if there are multiple writes to the same block, combining neighbouring writes into a single write, and reordering the writes within the journal into the correct order. (From g_journal_insert and g_journal_optimize in sys/geom/journal/g_journal.h)

When a journal device is started, if the metadata indicates that a copy was in progress, then the journal is reread and its records are added to the queue to be flushed to disk. If this cannot be done, the journal is reinitialised and marked as dirty.

When reads are requested from the journalled device, some cleverness is done to check first for the data in the cache, then the journal, then the disk.

How could gjournal be modified to suit our needs?

So now we have a better understanding of how gjournal works, how could we modify it to support replication? Let’s try and keep it as simple as we can.

The first thing is that we want to modify the gjournal geom as little as possible. We want the kernel to just have the stuff for reading and writing data to and from a device. All the tricky stuff should live in userland. Doing this not only keeps the kernel smaller and simpler, but also allows changes from gjournal to be merged back in more easily.

As gjournal uses the journal as the unit of commit (either all the writes in a journal are completed, or none of them are considered completed), it would make sense to use this as the unit of replication (either all the writes in a journal are replicated, or none of them are considered replicated). When gjournal has written all the records from one journal to disk, the metadata is updated to reflect the new last known consistent journal (md_joffset and md_jid). We could copy this so that when the records in a journal have been successfuly replicated, we update the metatdata items for the replicated journal’s ID and offset on the master (md_rjoffset and md_rjid). This information relates to the filesystem itself, so it can be added to the metadata. The two pieces of metadata will track the replication of journals in the same way gjournal uses jid and joffset to track the writing of data to the data provider.

Let’s modify the journal device creation process slightly, so that as well as creating an area for the journal and storing the starting and ending offsets in the metadata, we also reserve a small space (say 1MB by default) in the journal provider for use as a Data Change Map. We also need to store its start and end offsets in the metadata. We will also add some extra bits to the metadata:

gjournal already monitors the usage of the journal provider, both to know when to switch journals, and to panic the system if the journal overflows. It checks whether to panic by calculating if the active journal is overwriting the inactive journal. This check is basically a test of whether we are writing over the md_joffset – the offset of the last journal. When gjournal checks for journal overflow, it could also check whether the active journal is overwriting the first unreplicated journal, and if so, we perform the action for a replication overflow. A journal overflow is a big deal, so it panics the kernel. A replication overflow is not such a big deal, so instead we can track whether or nor the replica has overflowed in the metadata. This metadata could possibly be stored in the md_flags metadata.

This leaves us with the following metadata to be added to struct g_journal_metadata, with the modifications neccessary to be aplied to journal_metadata_encode, journal_metadata_decode_v1 (a new function based on journal_metadata_decode_v0) and journal_metadata_dump. We should also bump the metadata version number (md_version) up to 1 too.

Name	Data type	Description
md_dcmstart	uint64_t	The starting offset of the Data Change Map
md_dcmend	uint64_t	The ending offset of the Data Change Map
md_rjid	uint32_t	The ID of the journal last replicated to the slave
md_rjoffset	uint64_t	The offset of the journal last replicated to the slave

At this point we are tracking some additional metadata, but not doint much with it. We could add additional functions for the following:

Read next unreplicated journal. This reads the md_rjoffset and md_rjid metadata, finds the last-replicated journal, and reads its header to find the next journal. It then reads the next journal, returns it and retains its offset and id in memory.
Mark next journal as replicated. This updates the metadata to update the md_rjid and md_joffset to the values in memory.
Import journal. This allows an entire journal to be written to the journal provider in one go.
So if we have essentially copied gjournal, added some attributes for tracking replication, and created some functions to get data in to and out of the system, what happens with these?

The daemon

This is where (some of) the functionality of ggate is replicated. We could have a daemon (which we’ll call grjournald). When grjournald starts up, it reads a configuration file, telling it which geom IDs it should be replicating, and which IP addresses it should be replicating with. It then attempts to handshake with each of these peers.

In the event of having a single slave and a single master, the master then requests the next unreplicated journal, and sends it to the slave. The slave grmirrord receives the data, and writes it to the journal provider (We don’t write direct to the data provider, otherwise the slave won’t be journalled). When all the records from that journal have been sent to the slave, the slave sends an acknowledgement, the last replicated journal ID is updated in the metadata on the master, and the data provider is updated on the slave, using the normal “update metadata to mark as dirty, write records to disk, update metadata to mark as clean” process.

So now we have a system where journals from the master are replicated to the slave, and provided there is one master and one slave, and the journal never overflows (i.e. replication is fast enough, and writes are slow enough) everything will work perfectly.

But what happens if the journal overflows? How did we decide who was the master and who was the slave? How do we make the master the slave and vice versa?

The daemon’s workflow

All these setting below will err on the side of caution. If we don’t have a setting telling us what to do, we will do nothing and wait for an administrator to tell us what to do. If it doesn’t make sense to do something, we will let you do it if you force us, but not otherwise.

When the daemon starts, it will read a configuration file (say /etc/grjournald.conf by default). This will contain the geom ID, which IP addresses should be involved in replication, and any settings that are appropriate. We check that the daemon, kernel and data versions are all in sync.

Provided we have a valid configuration file, we try to connect to the partner IP address, and talk to a daemon there. If we fail, we do nothing but carry on listening and optionally trying to reconnect periodically.

If we find another grmirrord on the partner IP address, we initially handshake, checking the grmirror versions are the same at both ends.

So how do the daemons initially decide who is the master, and who is the slave? The default setting is that they will do nothing, but enter an administrative wait state, waiting for an administrator to issue a command to say which should be the master.

If they both claim to be the master, it’s likely that the slave has forcefully taken over the master role while the original master has been down or disconnected. In this case, we can go to an administrative wait state, unless the configuration file says automatically recover.

A map for the autorecovery actions is:

Last known state ID	0	1	2	3	4	5	6
0	WAIT	WAIT**	WAIT**	WAIT**	WAIT**	WAIT**	WAIT**
1		Lastonline becomes master – normal replication	Resume normal replication	Lastonline becomes master*	Lastonline becomes master*	Lastonline becomes master*	Resume resync replication
2			WAIT**	Lastonline becomes master*	WAIT**	Resume resync replication	WAIT**
3				Lastonline becomes master – DCM replication	Resume DCM replication	Resume resync replication	Resume resync replication
4					WAIT**	Resume resync replication	WAIT**
5						Lastonline becomes master – resync replication	Resume resync replication
6							WAIT**

* = the replication mode chosen is the replication mode of the partner chosen as master (with the most recent last known master timestamp)
** = this state should not occur unless something has gone horribly wrong or things have been forced into inappropriate modes

If we need to automatically recover, and we are in a state where it is likely that the original slave has taken over the master role, we check the “last known master” metadata attribute, and the one with the most recent timestamp becomes the master, with the replication mode being determined by the new master. Note that this autorecovery will only work correctly if the original master has gone offline and not had any data modified. If the original master has had a failure and been restored from a backup, or been rebuilt, then a resync needs to be forced manually.

There is also a requirement that the systems will need synchronised clocks for the timestamps to be accurate, so that the autorecovery will correctly select the most recent master.

Normal replication

As described above, the normal flow of data in normal replication is:

Journal entries are written to a second queue on the master
Journal entries are replicated from the master to the slave
When the slave has written all the entries in a journal to its journal provider, the metadata on the slave is updated with the ID and offset (in the master’s journal) of the next journal. The master is notified that the previous journal is complete.

At startup in normal replication, the last replicated journal ID and offset on the master are set to a NULL value. The master requests the ID and offset of the last replicated journal from the slave, and finds this journal and all subsequent journals, up to the inactive (last written to disk) journal. It adds the records entries for these journals to the replication queue.

If at any point the replication journal overflows on the master, the replication overflow action is called. By default this will be do nothing (except the slave will be out-of date), though the configuration file may set the action to be to fallback to DCM mode.

DCM replication

Before we delve into DCM replication, let’s define the DCM itself. The DCM (data change map) is a bitmap where bits in the map correspond to areas on the data provider. Suppose we have a 1MB (1024^2 * 8 bits) DCM, and a 1TB (1024^4 bytes) data provider. In this case, each bit in the bitmap corresponds to a region of 128KB in the data provider (this may not be a good example, or a suitable size to use, so should be tunable).

When we enter DCM mode, the master firstly sets the DCM up-to-date bit in the metadata off. It then reads all the bios waiting in the replication queue, and for the corresponding bit in the DCM, sets it to 1. When finished, it sets the DCM up-to-date bit on. As each journal entry is added to the flush-to-disk queue, the corresponding bit in the DCM is set on.

In DCM replication mode, each bit is checked in the DCM, and if the bit is set, the master reads the data in the corresponding region on the data provider. It then optionally compresses it and sends it to the slave, which acknowledges and writes the data to disk.

When there are no bits left on in the DCM, the slave and master are in sync, and normal replication can resume.

Note that during DCM replication, write-order fidelity is not maintained, so the slave may not be consistent – there is no guarantee that data is recoverable.

Also there’s still some working out to do here to make sure we don’t miss any writes where the DCM bit has already been reset.

Resync replication

When resync replication is started (usually manually – it can be used to initially sync the master and slave), all the bits in the DCM on the master are set on. For each corresponding region on the data provider, the slave and master each read the region, and create a checksum. If the checksums match, they move on to the next region (and reset the bit in the DCM). If not, then the master sends the data to the slave.

Note that during the resync, the DCM needs to be updated as new writes are made to the master.

So what about swapping the slave/master roles?

If both systems are online, a clean transfer can be initiated by the grmirrords sending a request to each other. Writes to the master must be suspended, and the replication queue drains. Once the queue has drained, the roles are transferred, and writes are enabled on the new master.

If the master and slave are unable to communicate, the command may be run on the slave with the -f switch, which takes over the master role and enables writes.

Scope

Data link encryption

Rate limiting for the data link

Ability to pause the data

Ability to allow access to slave snapshots (for backups etc)

Other things which would probably be less easy, but not impossible include:

Allowing multiple slaves
Bunker replication