Did you know you can create your own Linux AWS EC2 AMI which is running 100% ZFS for all filesystems (/, /boot - everything)? You can, and it’s not too hard as long as you are experienced with installing Linux without an installer. Here’s the rough instructions for setting this up with a modern Debian based system (I’ve tested with Debian and Ubuntu). As far as I know, this is the first published account of how to set this up. There aren’t any prebuilt AMIs available that I know of, but I might just do that unless someone else beats me to it.
Why run ZFS for the root filesystem? Not only is ZFS a high performing filesystem, but using native ZFS for everything makes storage management a cinch. For example, want to keep your root EBS volumes small? No problem - keep your AMI on a 1GB volume (yes, it’s possible to be that small), and extend the ZFS pool dynamically at runtime by attaching additional EBS volumes as needed. ZFS handles this extremely well.
Why build your own AMI instead of using a prebuilt one? There’s a couple of good reasons, but the primary one is that you get a minimal AMI with the least bit of cruft and bloat possible. Many of the prebuilt cloud AMIs have a bunch of package installed that you might not need or want. By building from scratch, your AMI contains just the things you want, not only lowering EBS costs, but potentially reducing security risks.
Note that we don’t do anything special with ephemeral drives here - that’s best kept in it’s own ZFS pool anyway, since mixing EBS and ephemeral drives will have some interesting performance consequences. You can use ephemerals on an instance, of course (in fact, it works great to stripe across all ephemeral drives, or you could use SSD ephemerals for ZFS L2ARC) - that’s just not the purpose of this article.
These instructions will only create an AMI that will boot on an HVM instance type. Although it’s easy enough to create a snapshot that can be registered separately as both HVM or PV AMI, all new AWS instance types support HVM. Because of this, I’ve decided only to support newer instance types, hence HVM-only.
I’ve tested this with the upcoming Debian Stretch (“testing” as of this writing), as well as Ubuntu Yakkety. It should work with Ubuntu Xenial as well, but I wouldn’t try anything earlier, since ZFS support is relatively new and maturing rapidly (last time I tried Debian Jessie with 100% ZFS I found that grub was too old to support booting into ZFS, although a separate EXT4 /boot works fine. This may have changed since then).
Again, these instructions assume you are pretty familiar with installing Debian via debootstrap, which means manually provisioning volumes, partitioning them, creating filesystems, bootstrapping, and chrooting in for final setup. If you don’t know what all these things mean, you might find this a difficult undertaking. Unlike installing to your own hardware, there’s very little instrumentation if things go wrong, and only a read-only console (if you are lucky - if networking does not initialize properly, you might not even get that). Expect this to take a few iterations and some frustration - this is a general guide, not step-by-step instructions!
If you’ve ever installed a Debian based system from scratch, you’ll note that most of these steps are no different than you’d do on physical hardware. There are only a few things that are AWS specific, but the vast majority is exactly how you’d install on bare metal.
Fire up a host instance to build out the AMI. This doesn’t need to be the same distribution or version as the AMI to be built, but it has to be recent enough to have ZFS. Debian Jessie (with jessie-backports) or Ubuntu Xenial will work.
We’ll use this instance (the “host”) to build out the target AMI, and if things don’t go well we can come back to it and try again (so don’t terminate it until you are ready, or have a working target AMI).
Once the host instance is up, provision a GP2 EBS volume via the AWS console and attach it to the host. We use a 10GB volume, but you could make this as small as 1GB if you really want to (be aware GP2 doesn’t perform well with small volumes).
We’ll assume the newly provisioned volume is attached at /dev/xvdf. The actual device might vary, use “dmesg” if you aren’t sure.
Next, update /etc/apt/sources.list with the full sources list for your host distribution. For Debian, use “main contrib non-free” - and you’ll need jessie-backports if the host is Jessie. For Ubuntu, use “main restricted universe multiverse”.
Next, install ZFS and debootstrap:
$ apt update
$ apt install zfs-zed zfsutils-linux zfs-initramfs zfs-dkms debootstrap gdisk
Now it’s time to set up ZFS on the new EBS volume. Assuming the target volume device is /dev/xvdf, we’ll create a GPT partition table with a small GRUB EFI partition and leave the rest of the disk for ZFS.
Be careful - many instructions out on the net for ZFS munge the sector geometry, or fake sgdisk into using an unnatural alignment. The following is correct (per AWS documentation) not only for EBS, but is the exact same geometry I use when installing Linux with ZFS root on physical hardware.
$ sgdisk -Zg -n1:0:4095 -t1:EF02 -c1:GRUB -n2:0:0 -t2:BF01 -c2:ZFS /dev/xvdf
This will create a small partition labelled GRUB (type EF02, 4096 sectors), and use the rest of the disk with a partition labelled ZFS (type BF01). The grub partition doesn’t technically need to be as big as 4096 sectors, but this insures everything is aligned properly.
It’s worth noting that I never give ZFS a full disk, and instead I always use partitions for ZFS pools. If you give ZFS the entire disk, it will create it’s own partition table, but waste 8MB in a Solaris partition that Linux has no use for.
OK, great, next up let’s create our ZFS pool and set up some filesystems. This will set the target up in /mnt. You can choose any mount point you want, just remember to use it consistently if you choose a different one.
I use the ZFS pool name “rpool”, but you can choose a different one, just be careful to substitute yours everywhere.
You may want different options - this will globally enable lz4 compression and disable atime for the pool. You may want to disable compression generally and only enable it for specific filesystems. The choice is up to you. We also allow overlay mount on /var. This is an obscure but important bit - when the system initially boots, it will log to /var/log before the /var ZFS filesystem is mounted. Because the mount point is dirty, ZFS won’t mount /var without setting the overlay flag. Note that /dev/xvdf2 is the second GPT partition we created above.
$ zpool create -o ashift=12 -O compression=lz4 -O atime=off -m none -R /mnt rpool /dev/xvdf2
$ zfs create -o mountpoint=/ rpool/root
$ zfs create -o mountpoint=/home rpool/home
$ zfs create -o mountpoint=/tmp rpool/tmp
$ zfs create -o overlay=on -o mountpoint=/var rpool/var
You may wish to have different ZFS filesystems, of course. And note we don’t set any quotas - we let all our filesystems share the entire storage pool.
At this point, the usual storage commands should show everything mounted up and ready for bootstrap (“zpool status”, “zfs list”, “df”, etc).
Now we’ll install our target distribution on the newly provisioned volume. There’s not much to do in this step:
$ debootstrap --arch amd64 stretch /mnt
Or if for Ubuntu Yakkety:
$ debootstrap --arch amd64 yakkety /mnt
Note that we can do this cross-distribution. We can bootstrap Ubuntu from a Debian host, or a Debian target from an Ubuntu host.
Next up we need to chroot into the target before doing final configuration.
$ mount --rbind /dev /mnt/dev
$ mount --rbind /proc /mnt/proc
$ mount --rbind /sys /mnt/sys
$ chroot /mnt
At this point, you should have a root shell into the target system.
Now we’ll do some final configuration. Some of the steps here are different between Debian and Ubuntu, but the general theme is the same.
Update /etc/apt/sources.list with the full sources list for your target distribution. For Debian, use “main contrib non-free”. For Ubuntu, use “main restricted universe multiverse”. Be sure you are setting up sources.list for your target distribution, not the host like we did before!
Install packages, but be sure NOT to install grub when it asks - you’ll have to acknowledge that this will result in a broken system (for now, anyway).
$ apt update
# Debian
$ apt install linux-image-amd64 linux-headers-amd64 grub-pc zfs-zed zfsutils-linux zfs-initramfs zfs-dkms cloud-init gdisk locales
$ ln -s /proc/mounts /etc/mtab
# Ubuntu
$ apt install linux-image-generic linux-headers-generic grub-pc zfs-zed zfsutils-linux zfs-initramfs zfs-dkms cloud-init gdisk
# All
$ dpkg-reconfigure locales # Choose en_US.UTF-8 or as appropriate
$ apt install --no-install-recommends openssh-server
Note creating the symlink to /etc/mtab for Debian - There was a bug in ZFS that relied on using /etc/mtab. We got that bug fixed in Ubuntu by Canonical, but as of a couple of months ago, Stretch didn’t yet have the fix - it’s probably fixed in Debian as well by now.
On Debian, I found I needed to modify GRUB_CMDLINE_LINUX in /etc/default/grub with the following. Note escaping ‘$’:
GRUB_CMDLINE_LINUX="boot=zfs \$bootfs"
This additional step might go away (or already be resolved) with a newer version of ZFS and grub in stretch. You could (should) probably add this to the grub.d configuration we add later, rather than here.
Verify grub and ZFS are happy. This is very important. If this step doesn’t work, there’s no point in continuing - the target will not boot.
This verifies that grub is able to probe filesystems and devices and has ZFS support. If this returns an error, the target system isn’t going to boot.
Everything is good, so let’s install grub:
Note we give grub the entire EBS volume of xvdf, not just xvdf1. This is important (installing to just the GRUB partition will result in a non-booting system).
Again, if this fails, you’ll need to diagnose why and potentially start over, as you won’t have a bootable target system.
Now we need to add a configuration file for grub to set a few things. To do this, create a file in “/etc/default/grub.d/50-aws-settings.cfg”:
GRUB_RECORDFAIL_TIMEOUT=0
GRUB_TIMEOUT=0
GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0 ip=dhcp tsc=reliable net.ifnames=0"
GRUB_TERMINAL=console
This will configure grub to log as much as possible to the AWS console, get an IP address as early as possible, and force TSC (time source) to be reliable (an obscure boot parameter required for some AWS instance classes). net.ifnames is set so ethernet adapters are enumerated as ethX instead of ensXX.
Now, let’s update grub:
You might want to check “/boot/grub/grub.cfg” at this stage to see if the zfs module will be probed and it’s got the right boot line (vague advice, I know).
Finally, set the ZFS cache and reconfigure - these might be unnecessary, but since this works, I superstitiously don’t skip it :-).
$ zpool set cachefile=/etc/zfs/zpool.cache rpool
$ dpkg-reconfigure zfs-dkms
Now, just a few sundry things left to do.
Update “/etc/network/interfaces” with:
auto eth0
iface eth0 inet dhcp
Again note that we’ve altered the boot commandline so network devices will be enumerated as ethX, instead of ensXX.
Don’t drop this config into “/etc/network/interfaces.d/eth0.cfg” - cloud-init will blacklist that configuration.
Finally, you may wish to provision and configure a user (cloud-init will set up a “debian” or “ubuntu” user already by default). You may want to give root user a secure passwd and update /etc/ssh/sshd_config to allow PermitRootLogin if this is appropriate for your environment and security policies.
Before creating an AMI, we need to exit the chroot, unmount everything, export the pool - basically quiesce the target so the volume can be snapshot.
Exit the chroot:
Now, you should be back in the host instance.
Unmount the bind mounts (we use the lazy option, otherwise unmounts can fail):
$ umount -l /mnt/dev
$ umount -l /mnt/proc
$ umount -l /mnt/sys
And finally, export the ZFS pool.
Now, “zpool status”, “df”, etc should show that our target filesystems are unmounted, and /dev/xvdf is free to be safely cloned. If anything here fails (unmounting, exporting), the target will not be in a good state and won’t boot.
Now we are all set to create an AMI from our target EBS volume.
In the AWS console, take a snapshot of the target EBS volume - this should take a minute or two.
Next, also in the AWS console, select the snapshot and register a new AMI. Be sure to register as HVM and set up ephemeral mappings as you wish. Don’t mess with kernel ID and other parameters.
Once registered, launch your shiny new AMI on an instance and enjoy ZFS root filesystem goodness.
If your instance never comes up, take a look at the console logging available in the AWS console. This is the only real avenue you have to debug a failed launch, and it’s very limited. If grub fails, the log might be empty. If networking fails, the log should have some details, but the instance will not be reachable.
A very useful debugging technique for AMIs is to terminate the instance, but don’t destroy the EBS volume - instead, attach the volume to another instance and import the ZFS pool there. This will allow you to look at logs so hopefully you can figure out why the boot failed.
If the instance doesn’t come up, you can re-import the ZFS pool on the host used to stage the target and try to fix it (remember above, I suggested leaving the host and target EBS volume around so you can iterate on it). Do the bind mounts before your chroot, and don’t forget to unmount everything and export the pool before taking another snapshot.
Login with the “debian” or “ubuntu” users (with the default passwords), if provisioned by default cloud-init - or however they are provisioned by cloud-init if you customize it. Or login as root if you set the root passwd and modified ssh configuration to allow root login.
Did it work? If so great! If not, give it another try, paying careful attention to any errors, as well as scouring output of dkms builds, etc. This isn’t completely straightforward, and it took me a few tries to get things figured out.
Now, let’s show the power of ZFS by adding 100GB, which will be available across the entire rpool, without having to fracture filesystems, mount new storage to it’s own directory, or move files around to the new device.
Assuming we used a 10GB EBS volume for the AMI, our pool probably looks something like:
$ zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 9.94G 784M 9.22G - 0% 0% 1.00x ONLINE -
In the AWS console, create a new 100GB GP2 EBS volume and attach it to your running instance.
Assuming the volume is attached as /dev/xvdf, let’s extend rpool into this new volume:
$ sgdisk -Z -n1:0:0 -t1:BF01 -c1:ZFS /dev/xvdf
$ zpool add rpool /dev/xvdf1
This partitions the volume with a new GPT table, using everything for ZFS (again, I don’t like giving ZFS the raw volume, as it will waste a bit of space when it partitions the volume for Solaris compatibility). Finally, we extend rpool onto the new volume.
That’s it! Now we see:
$ zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 109G 784M 109G - 0% 0% 1.00x ONLINE -
xvda1 9.94G 735M 9.22G - 7% 7%
xvdf1 99.5G 48.7M 99.5G - 0% 0%
We’ve added 100GB of storage completely transparently, and unlike creating a traditional EXT or XFS volume we don’t have to mount it into a new directory - with ZFS the storage is just there, and available to all our ZFS filesystems.
Hope that helps for anyone else looking to run ZFS exclusively in AWS. While not as easy as taking an off-the-shelf prebuilt AMI, you end up with an AMI that has only a minimal Debian or Ubuntu install - you know exactly want went into it, and the process for doing so.
If you run into any issues trying this, you can indirectly contact me by commenting on this blog entry, or try in ##aws on Freenode.