Talk about responsive…

I’ve been using Nagios for a while to monitor several servers. When I set it up, one of the things I left until later was monitoring PostgreSQL databases, as it requires a plugin, and I wasn’t sure which plugins to try. Then I noticed an announcement on the PostgreSQL website that a new version (version 2.9.2) of the check_postgresql.pl plugin was available, so I thought I’d give it a try.

The plugin turned out to be easy to install and configure, though a little clarification on some of the documentation wouldn’t go amiss. There are two checks the plugin performs, which check to see whether a new version of the plugin is available, and whether a new version of PostgreSQL is available. Neither of these worked for me.

I had a dig through the source code, and saw that there were 5 commands used to retrieve the latest version information from the internet. FreeBSD doesn’t have any of these commands installed by default, but it does come with fetch, which does the same job. So I added the additional line for fetch to the plugin, and tested it. It worked just fine, so the next thing was to feed that change back to the check_postgresql.pl project. There’s no bug tracker on the site, and I didn’t feel like signing up to the mailing list, so I emailed the developer.
I set the plugin to manually recheck for a new version (I’d only set it to automatically check once per day), and it said there was a new version available (version 2.9.3). At first I thought I’d done something wrong, until I downloaded the new version, and found my patch incorporated in it (with a few other fixes).
Time from emailing the developer to the new version being released: 38 minutes
Gotta love open source and responsive developers…

Posted in FreeBSD, PostgreSQL | Leave a comment

FreeBSD 7.2 dmesg on the Dell PowerEdge R300

While looking at the stats for this site, I noticed that plenty of people seems to be searching for information on running FreeBSD on the Dell PowerEdge R300. All I can tell you is that since FreeBSD 7.1 it works just fine, without modification, using the GENERIC kernel.

Below is the dmesg from FreeBSD 7.2 (amd64). This is for a standard R300, with no additional options other than a second hard drive. It also has a USB memory stick connected to the internal USB port, so that the server can still be booted even if both hard drives fail (theoretically – not tested).

Copyright (c) 1992-2009 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 7.2-RELEASE-p1 #5: Tue Jun 16 21:42:57 BST 2009
xxxxx@xxx.xxxxxxx.xx.xx:/usr/obj/usr/src/sys/GENERIC
Timecounter “i8254” frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(R) CPU X3323 @ 2.50GHz (2500.01-MHz K8-class CPU)
Origin = “GenuineIntel” Id = 0x10676 Stepping = 6
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0xce3bd<SSE3,RSVD2,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA,<b19>>
AMD Features=0x20100800<SYSCALL,NX,LM>
AMD Features2=0x1<LAHF>
Cores per package: 4
usable memory = 2128711680 (2030 MB)
avail memory = 2052505600 (1957 MB)
ACPI APIC Table: <DELL PE_SC3 >
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
cpu2 (AP): APIC ID: 2
cpu3 (AP): APIC ID: 3
ioapic0: Changing APIC ID to 4
ioapic0 <Version 2.0> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <DELL PE_SC3> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
Timecounter “ACPI-safe” frequency 3579545 Hz quality 850
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
acpi_hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter “HPET” frequency 14318180 Hz quality 900
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <PCI-PCI bridge> at device 2.0 on pci0
pci3: <PCI bus> on pcib1
pcib2: <PCI-PCI bridge> at device 3.0 on pci0
pci4: <PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> at device 4.0 on pci0
pci5: <ACPI PCI bus> on pcib3
pcib4: <PCI-PCI bridge> at device 5.0 on pci0
pci6: <PCI bus> on pcib4
pcib5: <ACPI PCI-PCI bridge> at device 6.0 on pci0
pci7: <ACPI PCI bus> on pcib5
pcib6: <ACPI PCI-PCI bridge> at device 7.0 on pci0
pci8: <ACPI PCI bus> on pcib6
pci0: <base peripheral> at device 8.0 (no driver attached)
pcib7: <PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci9: <PCI bus> on pcib7
pcib8: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
pci1: <ACPI PCI bus> on pcib8
pci0:1:0:0: failed to read VPD data.
bge0: <Broadcom BCM5722 A0, ASIC rev. 0xa200> mem 0xdfdf0000-0xdfdfffff irq 16 at device 0.0 on pci1
miibus0: <MII bus> on bge0
brgphy0: <BCM5722 10/100/1000baseTX PHY> PHY 1 on miibus0
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bge0: Ethernet address: 00:1e:4f:33:d6:d0
bge0: [ITHREAD]
pcib9: <ACPI PCI-PCI bridge> irq 17 at device 28.5 on pci0
pci2: <ACPI PCI bus> on pcib9
bge1: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0xa200> mem 0xdfef0000-0xdfefffff irq 17 at device 0.0 on pci2
miibus1: <MII bus> on bge1
brgphy1: <BCM5722 10/100/1000baseTX PHY> PHY 1 on miibus1
brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bge1: Ethernet address: 00:1e:4f:33:d6:d1
bge1: [ITHREAD]
uhci0: <UHCI (generic) USB controller> port 0xdc80-0xdc9f irq 21 at device 29.0 on pci0
uhci0: [GIANT-LOCKED]
uhci0: [ITHREAD]
usb0: <UHCI (generic) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb0
uhub0: 2 ports with 2 removable, self powered
uhci1: <UHCI (generic) USB controller> port 0xdca0-0xdcbf irq 20 at device 29.1 on pci0
uhci1: [GIANT-LOCKED]
uhci1: [ITHREAD]
usb1: <UHCI (generic) USB controller> on uhci1
usb1: USB revision 1.0
uhub1: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb1
uhub1: 2 ports with 2 removable, self powered
uhci2: <UHCI (generic) USB controller> port 0xdcc0-0xdcdf irq 21 at device 29.2 on pci0
uhci2: [GIANT-LOCKED]
uhci2: [ITHREAD]
usb2: <UHCI (generic) USB controller> on uhci2
usb2: USB revision 1.0
uhub2: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb2
uhub2: 2 ports with 2 removable, self powered
ehci0: <EHCI (generic) USB 2.0 controller> mem 0xdfcffc00-0xdfcfffff irq 21 at device 29.7 on pci0
ehci0: [GIANT-LOCKED]
ehci0: [ITHREAD]
usb3: EHCI version 1.0
usb3: companion controllers, 2 ports each: usb0 usb1 usb2
usb3: <EHCI (generic) USB 2.0 controller> on ehci0
usb3: USB revision 2.0
uhub3: <Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1> on usb3
uhub3: 6 ports with 6 removable, self powered
umass0: <LEXAR MEDIA JUMPDRIVE SPORT, class 0/0, rev 2.00/30.00, addr 2> on uhub3
uhub4: <vendor 0x04b4 product 0x6560, class 9/0, rev 2.00/90.15, addr 3> on uhub3
uhub4: single transaction translator
uhub4: 2 ports with 2 removable, self powered
pcib10: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci10: <ACPI PCI bus> on pcib10
vgapci0: <VGA-compatible display> port 0xec00-0xecff mem 0xd0000000-0xd7ffffff,0xdfff0000-0xdfffffff irq 19 at device 7.0 on pci10
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH9 SATA300 controller> port 0xdc20-0xdc27,0xdc10-0xdc13,0xdc28-0xdc2f,0xdc14-0xdc17,0xdc40-0xdc4f,0xdc50-0xdc5f irq 23 at device 31.2 on pci0
atapci0: [ITHREAD]
ata2: <ATA channel 0> on atapci0
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci0
ata3: [ITHREAD]
atapci1: <Intel ICH9 SATA300 controller> port 0xdc30-0xdc37,0xdc18-0xdc1b,0xdc38-0xdc3f,0xdc1c-0xdc1f,0xdc60-0xdc6f,0xdc70-0xdc7f irq 22 at device 31.5 on pci0
atapci1: [ITHREAD]
ata4: <ATA channel 0> on atapci1
ata4: [ITHREAD]
ata5: <ATA channel 1> on atapci1
ata5: [ITHREAD]
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
sio0: configured irq 4 not in bitmap of probed irqs 0
sio0: port may not be enabled
sio0: configured irq 4 not in bitmap of probed irqs 0
sio0: port may not be enabled
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A
sio0: [FILTER]
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
sio1: [FILTER]
cpu0: <ACPI CPU> on acpi0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
p4tcc0: <CPU Frequency Thermal Control> on cpu0
cpu1: <ACPI CPU> on acpi0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
p4tcc1: <CPU Frequency Thermal Control> on cpu1
cpu2: <ACPI CPU> on acpi0
est2: <Enhanced SpeedStep Frequency Control> on cpu2
p4tcc2: <CPU Frequency Thermal Control> on cpu2
cpu3: <ACPI CPU> on acpi0
est3: <Enhanced SpeedStep Frequency Control> on cpu3
p4tcc3: <CPU Frequency Thermal Control> on cpu3
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
orm0: <ISA Option ROMs> at iomem 0xc0000-0xc8fff,0xc9000-0xc9fff,0xec000-0xeffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atkbd0: [ITHREAD]
ppc0: cannot reserve I/O port range
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 1.000 msec
ad4: 152587MB <WDC WD1601ABYS-18C0A0 06.06H05> at ata2-master SATA300
acd0: DVDROM <TEAC DVD-ROM DV28SV/D.0E> at ata2-slave SATA150
ad6: 152587MB <WDC WD1601ABYS-18C0A0 06.06H05> at ata3-master SATA300
GEOM_MIRROR: Device mirror/gm0 launched (2/2).
GEOM_LABEL: Label for provider mirror/gm0s1 is msdosfs/DellUtility.
GEOM_LABEL: Label for provider mirror/gm0s2 is msdosfs/OS.
GEOM_LABEL: Label for provider mirror/gm0s3a is ufsid/481ca24bb929589f.
GEOM_LABEL: Label for provider mirror/gm0s3d is ufsid/481ca2504609c3b4.
GEOM_LABEL: Label for provider mirror/gm0s3e is ufsid/481ca24b7ec903d6.
GEOM_LABEL: Label for provider mirror/gm0s3f is ufsid/481ca24bcf21dd1b.
SMP: AP CPU #1 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #3 Launched!
da0 at umass-sim0 bus 0 target 0 lun 0
da0: <LEXAR JUMPDRIVE SPORT 1000> Removable Direct Access SCSI-0 device
da0: 40.000MB/s transfers
da0: 495MB (1014784 512 byte sectors: 64H 32S/T 495C)
GEOM_LABEL: Label for provider da0s1a is ufsid/4828bbc089fb6d4e.
Trying to mount root from ufs:/dev/mirror/gm0s3a
GEOM_LABEL: Label ufsid/481ca24bb929589f removed.
GEOM_LABEL: Label for provider mirror/gm0s3a is ufsid/481ca24bb929589f.
GEOM_LABEL: Label ufsid/481ca24b7ec903d6 removed.
GEOM_LABEL: Label for provider mirror/gm0s3e is ufsid/481ca24b7ec903d6.
GEOM_LABEL: Label ufsid/481ca24bcf21dd1b removed.
GEOM_LABEL: Label for provider mirror/gm0s3f is ufsid/481ca24bcf21dd1b.
GEOM_LABEL: Label ufsid/481ca2504609c3b4 removed.
GEOM_LABEL: Label for provider mirror/gm0s3d is ufsid/481ca2504609c3b4.
GEOM_LABEL: Label ufsid/481ca24bb929589f removed.
GEOM_LABEL: Label ufsid/481ca24b7ec903d6 removed.
GEOM_LABEL: Label ufsid/481ca24bcf21dd1b removed.
GEOM_LABEL: Label ufsid/481ca2504609c3b4 removed.
bge0: link state changed to UP

Posted in FreeBSD | Leave a comment

The Whitchurch Branch

The Whitchurch Branch of the Ellesmere Canal ran close to Whitchurch town centre. Today only a short section remains, used mainly for moorings.

The overlay below was based on information gained from the 1902 Ordnance Survey map. Because there are few distinguishing features remaining, I may have got the path a few metres out, but that’s all.

There are plans to reinstate the Whitchurch Branch, but looking at the overlay, I’m less optimistic than I was.

Download the Whitchurch Branch overlay (Google Earth required)

Posted in Canals | Leave a comment

The Prees Branch

The Prees Branch on the Llangollen Canal is today a short branch that leads to Whixhall Marina. It was originally longer, reaching Quina Brook, though like many canal schemes it didn’t reach its original goal of Prees.

The bridge names have come from the 1902 Ordnance Survey map. Interestingly this map also refers to the branch as the Edstaston Branch. I’d never heard it referred to by this name before, though this makes sense, as the branch passed Edstaston, but never reached Prees.

Some of this section of canal is now a designated Site of Special Scientific Interest.

Download the Prees Branch overlay. (Google Earth required)

Posted in Canals | Leave a comment

The Woodhouse diversion

The Montgomery Canal has a long straight section, just south of the Perry aqueduct. It then heads towards Rednal Basin. It wasn’t always this way though.

When the canal was originally under construction one of the members of the original Ellesmere Canal Committee persuaded the company to divert the course of the canal to run closer to his estate (the Woodhouse estate), with a short branch towards his house. Later the canal was moved back to its original planned line. As with so many things about the canal, the original documents are either vague or missing. Apprently the now disused section continued to hold water for some time though, and was marked on old maps as “old Canal”. I can’t find any of these maps though!

So, the Google Earth overlay below is based on what I can see, and make a best guess based on ground markings, lines of trees, and a sketch of the the old line. I can’t see the branch to the house though. If you know more, you can always email me.

Download the Google Earth overlay

Information from “Montgomeryshire Canal and the Llanymynech Branch of the Ellesmere Canal” by John Horsley Denton (1982).

Posted in Canals | Leave a comment

No-one mentions the Weston Arm…

The Weston Arm of the Montgomery Canal ran from Welsh Frankton to Weston Lullingfields, and was built as part of the Ellesmere Canal. It was originally going to be the main line, taking traffic from the River Mersey at (what is now) Ellesmere Port to the River Severn at Shrewsbury, however like so many canal schemes it was never completed as intended. It ended up being closed in 1919 after a breach. Canal books often make many references to the Weston Arm, but few give it any real detail. After seeing the tiny stub of the canal that remains, I wondered where it went.

I’d already spent a little time looking at Google Earth, and cross-referencing with old maps. It sat on my computer for a few months, doing nothing, until a friend mentioned that he’d been wondering the same. So this is for Martin! The file below should open in Google Earth (which you can get from http://earth.google.com/).

weston branch.kmz

Posted in Canals | 6 Comments

My First UAC Encounter

I’ve been lucky enough to stay away from Windows Vista, having moved to the Mac at home. I’ve had several chances to play with Windows Vista, but I still prefer the look and feel of Windows XP. A friend who is a Vista user recently had to reinstall from scratch (due to a hardware issue, not a Windows issue), and I needed to use his PC (since reading NTFS filesystems on an old hard disk is something Windows is pretty good at).

The first time I used this PC, before the reinstall, my friend had disabled UAC, as he found it too annoying. However, after the reinstall he’d left it enabled to see if it was any less annoying. I’d not used Vista with UAC enabled before, so I didn’t know just how intrusive it was. I found out quite quickly.

I clicked Start, and entered mmc. I hit enter, and was immediately confronted with a dialog box, stating that mmc was trying to run, and should I allow it. I found this unbelievable, that I was being asked if I wanted a program that I’d explicitly asked to start to be started. I’d thought that UAC was to stop programs being started without your knowledge, not to ask you to confirm every single action!

I was left feeling like Vista’s interface was a step backwards from XP.

This post was brought to you by the “I like to blog about it two years after everyone else” department.

Posted in Windows | Leave a comment

Installing FreeBSD 7.0 on the Dell PowerEdge R300

The Dell PowerEdge R300has a Broadcom 5722 network card, which isn’t supported by the latest release of FreeBSD (7.0). Patches for the Broadcom 5722 are in the development versions of FreeBSD, but in order to get the development versions, it’s easiest if you have net access, which is hard without a working network card.

The simplest way to get FreeBSD working is to download the FreeBSD disk 1 (you probably want the amd64 version if you have plenty of memory). Then install FreeBSD in the usual way (in the handbook).

During installation you will see the first disk as device ad4. If you have a second disk it will be ad6. sysinstall will tell you that you have two fdisk partitions already (if you opted for Dell not to install an operating system). ad4s1 is Dell’s utilities partition. I chose not to remove these partitions, as I didn’t know what ad4s2 was, and there was no easy way of knowing from sysinstall. Once FreeBSD was installed I could mount ad4s2 (an msdosfs filesystem), and see that it contained… nothing! I could have therefore removed this during the install. It would also have been fine to remove ad4s1, as the diagnostic utilities are also available from Dell on a CD, so provided the CD drive is functional (or present) then it’s easy to run the latest version of the diagnostics..

During the installation, you can install whichever packaged you wish (I always choose a minimal install), plus you also need to install the kernel source code. Once you’ve installed, you then need to apply the changes found at:

These changes can be made with a text editor such as vi. Once the changes have been made, recompile the kernel (the GENERIC kernel is fine, or roll your own) as indicated in the handbook.

When you reboot dmesg should show extra lines such as:

bge0: <Broadcom BCM5722 A0, ASIC rev. 0xa200> mem 0xdfdf0000-0xdfdfffff irq 16 at device 0.0 on pci1
miibus0: <MII bus> on bge0
brgphy0: <BCM5722 10/100/1000baseTX PHY> PHY 1 on miibus0
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bge0: Ethernet address: 00:1e:4f:00:0000
bge0: [ITHREAD]
pcib9: <ACPI PCI-PCI bridge> irq 17 at device 28.5 on pci0
pci2: <ACPI PCI bus> on pcib9
bge1: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0xa200> mem 0xdfef0000-0xdfefffff irq 17 at device 0.0 on pci2
miibus1: <MII bus> on bge1
brgphy1: <BCM5722 10/100/1000baseTX PHY> PHY 1 on miibus1
brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bge1: Ethernet address: 00:1e:4f:00:00:01
bge1: [ITHREAD]

After doing this you may wish to consider building world and the kernel, and installing onto a USB key inserted into the internal USB socket. Set your BIOS to boot from hard disks first, then USB, and voila, you have a recovery environment (with a working network card) if your disks are rendered unbootable.

Note that with FreeBSD 7.1 and later the network interfaces will work “out the box”.

Posted in FreeBSD | Leave a comment

What makes banana chips healthy?

Seen on BBC Breakfast this morning in an interview about the government’s 5-a-day campaign:

Q: “What’s in banana chips that makes them healthy?”
A: “Banana”

Posted in Uncategorized | Leave a comment

A replicated filesystem proposal for FreeBSD

FreeBSD is a favourite operating system of mine. I’ve been using it for several years, starting with 5.0. One of the things that was new with 5.0 was GEOM – a disk infrastructure which is more flexible than the previous system.

The number of geom providers has expanded over the years, to about 20 today, providing striping, mirroring, encryption, multipathing and plenty of other things (some without man pages, so I don’t know what they do).

The ability to use different geoms together mean that we can combine them to have (for example) encrypted striped mirrors.

Mirroring (RAID1) is commonly used to protect against disk failure, but only on a single machine. Gianpaolo Del Matto has been ambitious and combined gmirror and ggate (which allows you to access devices on another machine) to mirror a filesystem between two systems. However in the event of a device failure or a network problem, the affected device needs to be removed and reinserted into the mirror. During the rebuild process the mirror is useless. It is also only suitable for fast interconnecting networks. If you have a slow network connection between the two servers, and a very large amount of data, a network interruption will require a rebuild that may take a considerable amount of time.

So what about if our two copies are on opposite sides of the world? What if the interconnecting network is slow? What if a large amount of writes are made all in one go?

As far as I can see, there’s nothing for FreeBSD that provides the ability to have filesystem replication between geographically separate servers. (rsync scripts and so on don’t provide instant updates). There is an option that performs this function for AIX, HP-UX, Solaris, Red Hat Enterprise Linux, SuSE Linux and even Windows. Symantec Veritas Volume Replicator (also known as VVR). Unsurprisingly it costs money. Quite a lot of it.

It seems that FreeBSD may have an advantage over other operating systems if it could replicate (at least the core functionality of) VVR. And given that the geom framework exists, along with gmirror, gjournal and ggate, it seems that (relatively) it wouldn’t be too hard to add that.

Terminology

Note that in this article, write-order fidelity means that writes are applied to the slave in the same order as they were on the master. The slave being consistent means that write-order fidelity has been maintained, so the slave represents the master as it was at some point in time. If write-order fidelity is not maintained, filesystem corruption and data loss may occur.

Syncronous/asynchronous refers to the replication mode between the master and the slave. It is also commonly used to describe how data is written to disk, but for this article I’ll only use it to describe the replication mode.

What’s so good about VVR?

One of VVR’s chief advantages over gmirror and DRDB (Distributed Replicated Block Device for Linux) is that it can replicate asynchronously, using a log so that network interruptions don’t require a mirror rebuild, and it can then maintain write-order fidelity. That’s not just asynchronous as in delayed by milliseconds, seconds or even minutes, but potentially hours or days. By allowing write-order fidelity to be maintained wherever possible, the slave can be used if the master is destroyed or lost. As soon as you lose write-order fidelity, there are no guarantees that the data on the slave is any use at all.

Why might we want this?

What applications might this be used for? There are plenty of examples. Some examples include having a web application which uses a both a database and file uploads. If you were using a combination of hot standby for your database, and rsync for your uploaded files, then you might end up with a situation where your standby database references a file which doesn’t exist yet on your standby server, or you have have an orphaned file which is not referenced by the database. By using replication which maintains write-order fidelity, the database files and uploaded files could be replicated to another webserver located in a different datacentre. If the main datacentre goes up in smoke, switch the web application to the second datacentre, and with a change of DNS you’ve got global High Availability.

For those who replicate data on a SAN using SAN-based replication, the ability to have the OS replicate data providers removed the need to purchase expensive licenses, and removes lock-in to hardware vendors. (By the time you’ve paid for two disk arrays at each end of a link, plus the replication license, costs can quickly mount up).

VVR is typically used with clustering (such as with Veritas Cluster Server), and while HA clustering might work differently (using jails for example), VVR-like functionality would remove a potential obstacle.

How can we do this with FreeBSD (or another OS)?

If a new geom provider were to be created for this purpose (called for the sake of argument grmirror), then some of the functionality (and presumably code) from gjournal and ggate could be reused. ggate already has a network daemon (ggated), and a client application (ggatec), so the ability to send data over a network is already present in FreeBSD. gjournal already has the concept of data journalling and a separate journal which maintains data consistency.

Let’s have a look at how gjournal works. This section is based on the RELENG_7 source code (which I may or may not interpret correctly) and posts to the freebsd-geom mailing list by Pawel Jakub Dawidek, who wrote gjournal, ggate and gmirror.

gjournal dissection

Before gjournal can be used a gjournal device must be created. If you label a single consumer (such as a bsdlabel slice), gjournal will create a journal data segment at the end on the consumer (1GB unless you explicityly specify the size), and use the rest of the consumer as the provider, with a geom label at the end. If you pass two consumers as attributes to gjournal, the entire second consumer is used for the journal data, and the entire first consumer is the data provider. Again, each consumer has a geom label at the end.

The geom label contains the usual geom data (magic value, version number, journal unique ID, provider type, provider (if hardcoded), provider’s size, MD5 hash), plus some gjournal specific metadata (the start and end offsets of the journal, the last known consistent journal offset, the last known consistent journal ID, journal flags). (From g_journal_metadata in sys/geom/journal/g_journal.h)

When gjournal is running, a journal is created, and a header is written which contains the journal’s ID, and the ID of the next journal. The IDs are randomly generated. For each write, a record header is added, which contains a record header ID, the number of entries, then each entry, with its offsets and length. The size of a journal is limited by how long it will remain open (10 seconds by default), how large it can fill the journal data provider/segment (70% by default), and how many record entries we will allow in a single journal (20 by default). The journal keeps track of how much of the journal provider is in use, and if the provider overflows, it will panic the system. (From g_journal_check_overflow in sys/geom/journal/g_journal.h)

When a journal is closed (because it’s been open long enough, filled enough of the journal provider or had enough writes), then its records are added to a queue to be flushed to disk. When this happens, metadata is updated to indicate copying has started. If optimisation is enabled, the journal data is optimised, then the journal data is sent to the data provider. When the copy has finished, the metadata is updated to indicate no copy is in progress, and the journal_id and offset of the successfully committed journal is stored in the metadata. A second journal is started after the end of the closed journal.

Writes are optimised by only writing the last write to a block if there are multiple writes to the same block, combining neighbouring writes into a single write, and reordering the writes within the journal into the correct order. (From g_journal_insert and g_journal_optimize in sys/geom/journal/g_journal.h)

When a journal device is started, if the metadata indicates that a copy was in progress, then the journal is reread and its records are added to the queue to be flushed to disk. If this cannot be done, the journal is reinitialised and marked as dirty.

When reads are requested from the journalled device, some cleverness is done to check first for the data in the cache, then the journal, then the disk.

How could gjournal be modified to suit our needs?

So now we have a better understanding of how gjournal works, how could we modify it to support replication? Let’s try and keep it as simple as we can.

The first thing is that we want to modify the gjournal geom as little as possible. We want the kernel to just have the stuff for reading and writing data to and from a device. All the tricky stuff should live in userland. Doing this not only keeps the kernel smaller and simpler, but also allows changes from gjournal to be merged back in more easily.

As gjournal uses the journal as the unit of commit (either all the writes in a journal are completed, or none of them are considered completed), it would make sense to use this as the unit of replication (either all the writes in a journal are replicated, or none of them are considered replicated). When gjournal has written all the records from one journal to disk, the metadata is updated to reflect the new last known consistent journal (md_joffset and md_jid). We could copy this so that when the records in a journal have been successfuly replicated, we update the metatdata items for the replicated journal’s ID and offset on the master (md_rjoffset and md_rjid). This information relates to the filesystem itself, so it can be added to the metadata. The two pieces of metadata will track the replication of journals in the same way gjournal uses jid and joffset to track the writing of data to the data provider.

Let’s modify the journal device creation process slightly, so that as well as creating an area for the journal and storing the starting and ending offsets in the metadata, we also reserve a small space (say 1MB by default) in the journal provider for use as a Data Change Map. We also need to store its start and end offsets in the metadata. We will also add some extra bits to the metadata:

gjournal already monitors the usage of the journal provider, both to know when to switch journals, and to panic the system if the journal overflows. It checks whether to panic by calculating if the active journal is overwriting the inactive journal. This check is basically a test of whether we are writing over the md_joffset – the offset of the last journal. When gjournal checks for journal overflow, it could also check whether the active journal is overwriting the first unreplicated journal, and if so, we perform the action for a replication overflow. A journal overflow is a big deal, so it panics the kernel. A replication overflow is not such a big deal, so instead we can track whether or nor the replica has overflowed in the metadata. This metadata could possibly be stored in the md_flags metadata.

This leaves us with the following metadata to be added to struct g_journal_metadata, with the modifications neccessary to be aplied to journal_metadata_encode, journal_metadata_decode_v1 (a new function based on journal_metadata_decode_v0) and journal_metadata_dump. We should also bump the metadata version number (md_version) up to 1 too.

Name Data type Description
md_dcmstart uint64_t The starting offset of the Data Change Map
md_dcmend uint64_t The ending offset of the Data Change Map
md_rjid uint32_t The ID of the journal last replicated to the slave
md_rjoffset uint64_t The offset of the journal last replicated to the slave

At this point we are tracking some additional metadata, but not doint much with it. We could add additional functions for the following:

  • Read next unreplicated journal. This reads the md_rjoffset and md_rjid metadata, finds the last-replicated journal, and reads its header to find the next journal. It then reads the next journal, returns it and retains its offset and id in memory.
  • Mark next journal as replicated. This updates the metadata to update the md_rjid and md_joffset to the values in memory.
  • Import journal. This allows an entire journal to be written to the journal provider in one go.
  • So if we have essentially copied gjournal, added some attributes for tracking replication, and created some functions to get data in to and out of the system, what happens with these?

The daemon

This is where (some of) the functionality of ggate is replicated. We could have a daemon (which we’ll call grjournald). When grjournald starts up, it reads a configuration file, telling it which geom IDs it should be replicating, and which IP addresses it should be replicating with. It then attempts to handshake with each of these peers.

In the event of having a single slave and a single master, the master then requests the next unreplicated journal, and sends it to the slave. The slave grmirrord receives the data, and writes it to the journal provider (We don’t write direct to the data provider, otherwise the slave won’t be journalled). When all the records from that journal have been sent to the slave, the slave sends an acknowledgement, the last replicated journal ID is updated in the metadata on the master, and the data provider is updated on the slave, using the normal “update metadata to mark as dirty, write records to disk, update metadata to mark as clean” process.

So now we have a system where journals from the master are replicated to the slave, and provided there is one master and one slave, and the journal never overflows (i.e. replication is fast enough, and writes are slow enough) everything will work perfectly.

But what happens if the journal overflows? How did we decide who was the master and who was the slave? How do we make the master the slave and vice versa?

The daemon’s workflow

All these setting below will err on the side of caution. If we don’t have a setting telling us what to do, we will do nothing and wait for an administrator to tell us what to do. If it doesn’t make sense to do something, we will let you do it if you force us, but not otherwise.

When the daemon starts, it will read a configuration file (say /etc/grjournald.conf by default). This will contain the geom ID, which IP addresses should be involved in replication, and any settings that are appropriate. We check that the daemon, kernel and data versions are all in sync.

Provided we have a valid configuration file, we try to connect to the partner IP address, and talk to a daemon there. If we fail, we do nothing but carry on listening and optionally trying to reconnect periodically.

If we find another grmirrord on the partner IP address, we initially handshake, checking the grmirror versions are the same at both ends.

So how do the daemons initially decide who is the master, and who is the slave? The default setting is that they will do nothing, but enter an administrative wait state, waiting for an administrator to issue a command to say which should be the master.

If they both claim to be the master, it’s likely that the slave has forcefully taken over the master role while the original master has been down or disconnected. In this case, we can go to an administrative wait state, unless the configuration file says automatically recover.

A map for the autorecovery actions is:

Last known state ID 0 1 2 3 4 5 6
0 WAIT WAIT** WAIT** WAIT** WAIT** WAIT** WAIT**
1   Lastonline becomes master – normal replication Resume normal replication Lastonline becomes master* Lastonline becomes master* Lastonline becomes master* Resume resync replication
2     WAIT** Lastonline becomes master* WAIT** Resume resync replication WAIT**
3       Lastonline becomes master – DCM replication Resume DCM replication Resume resync replication Resume resync replication
4         WAIT** Resume resync replication WAIT**
5           Lastonline becomes master – resync replication Resume resync replication
6             WAIT**

* = the replication mode chosen is the replication mode of the partner chosen as master (with the most recent last known master timestamp)
** = this state should not occur unless something has gone horribly wrong or things have been forced into inappropriate modes

If we need to automatically recover, and we are in a state where it is likely that the original slave has taken over the master role, we check the “last known master” metadata attribute, and the one with the most recent timestamp becomes the master, with the replication mode being determined by the new master. Note that this autorecovery will only work correctly if the original master has gone offline and not had any data modified. If the original master has had a failure and been restored from a backup, or been rebuilt, then a resync needs to be forced manually.

There is also a requirement that the systems will need synchronised clocks for the timestamps to be accurate, so that the autorecovery will correctly select the most recent master.

Normal replication

As described above, the normal flow of data in normal replication is:

  1. Journal entries are written to a second queue on the master
  2. Journal entries are replicated from the master to the slave
  3. When the slave has written all the entries in a journal to its journal provider, the metadata on the slave is updated with the ID and offset (in the master’s journal) of the next journal. The master is notified that the previous journal is complete.

At startup in normal replication, the last replicated journal ID and offset on the master are set to a NULL value. The master requests the ID and offset of the last replicated journal from the slave, and finds this journal and all subsequent journals, up to the inactive (last written to disk) journal. It adds the records entries for these journals to the replication queue.

If at any point the replication journal overflows on the master, the replication overflow action is called. By default this will be do nothing (except the slave will be out-of date), though the configuration file may set the action to be to fallback to DCM mode.

DCM replication

Before we delve into DCM replication, let’s define the DCM itself. The DCM (data change map) is a bitmap where bits in the map correspond to areas on the data provider. Suppose we have a 1MB (1024^2 * 8 bits) DCM, and a 1TB (1024^4 bytes) data provider. In this case, each bit in the bitmap corresponds to a region of 128KB in the data provider (this may not be a good example, or a suitable size to use, so should be tunable).

When we enter DCM mode, the master firstly sets the DCM up-to-date bit in the metadata off. It then reads all the bios waiting in the replication queue, and for the corresponding bit in the DCM, sets it to 1. When finished, it sets the DCM up-to-date bit on. As each journal entry is added to the flush-to-disk queue, the corresponding bit in the DCM is set on.

In DCM replication mode, each bit is checked in the DCM, and if the bit is set, the master reads the data in the corresponding region on the data provider. It then optionally compresses it and sends it to the slave, which acknowledges and writes the data to disk.

When there are no bits left on in the DCM, the slave and master are in sync, and normal replication can resume.

Note that during DCM replication, write-order fidelity is not maintained, so the slave may not be consistent – there is no guarantee that data is recoverable.

Also there’s still some working out to do here to make sure we don’t miss any writes where the DCM bit has already been reset.

Resync replication

When resync replication is started (usually manually – it can be used to initially sync the master and slave), all the bits in the DCM on the master are set on. For each corresponding region on the data provider, the slave and master each read the region, and create a checksum. If the checksums match, they move on to the next region (and reset the bit in the DCM). If not, then the master sends the data to the slave.

Note that during the resync, the DCM needs to be updated as new writes are made to the master.

So what about swapping the slave/master roles?

If both systems are online, a clean transfer can be initiated by the grmirrords sending a request to each other. Writes to the master must be suspended, and the replication queue drains. Once the queue has drained, the roles are transferred, and writes are enabled on the new master.

If the master and slave are unable to communicate, the command may be run on the slave with the -f switch, which takes over the master role and enables writes.

Scope

  • Data link encryption
  • Rate limiting for the data link
  • Ability to pause the data
  • Ability to allow access to slave snapshots (for backups etc)
  • Other things which would probably be less easy, but not impossible include:

    • Allowing multiple slaves
    • Bunker replication
    Posted in Computing, FreeBSD | Leave a comment