Self Hosted Kubernetes - Solving the Storage Problem

Kubernetes has a reputation for being hard to self-host in non-trivial deployments.

This remains true to a certain extent, but the emergence of slimmed down Kubernetes distributions have made managing small to medium cluster deployments much easier.

The most popular solutions are k3s, developed by Rancher (which is in turn owned by Suse), and microk8s, owned by Canonical.

You still need an operator that is familiar with hardware, operating system maintenance and the Kubernetes stack, but you will not necessarily need a highly trained Kubernetes specialist with years of experience.

While hosting a basic Kubernetes cluster has become relatively straight-forward, one critical aspect remains complicated and will require careful consideration: providing persistent disk storage to Kubernetes workloads in multi-node clusters.

In the cloud this is automatically and transparently handled for you, backed by networked storage like AWS EBS, GC Persistent Disk or Azure Disk Storage.

Disk storage is especially relevant in a self-hosted environment, where you will often run your own databases or object storage solutions that are otherwise available as managed services from cloud providers.

The Kubernetes ecosystem offers a wide variety of options with specific tradeoffs, ranging from extremely simple and bare-bones all the way to complex software and even hardware stacks that equal the power and flexibility of cloud providers.

Are you fine with pinning workloads to a specific node (server)? Do you need to (automatically) migrate storage volumes between different nodes? Is distributed access on multiple nodes required? What level performance is required? What about backups?

Choosing the right technology can be a difficult choice and highly confusing. This post will explore the problem space and evaluate a few of the most popular solutions.

Kubernetes Storage Internals - A Brief Introduction

Talking about storage requires a basic understanding of the storage primitives used by Kubernetes. This section will give a brief introduction covering the core concepts.

Note: if you are familiar with (persistent) volumes, snapshots and CSI drivers you might want to skip ahead to the next section.

The basic storage abstraction is a volume. It represents a distinct unit of storage that will be exposed to containers as a regular directory by binding it to a specific path.

Volumes are very flexible and can be backed by a wide variety of underlying mechanisms - from simple directories on the host filesystem, over object storage like S3 all the way to distributed network storage. They can be read-only or writable, restricted to a single pod or available to multiple pods concurrently, available as regular file systems or as raw block devices, … Various backends can provide different semantics and might be limited to a subset of functionality.

Volumes can have a fixed identity managed outside of Kubernetes, or be created on-demand and only available during the lifetime of a pod.

The next abstraction layer are persistent volumes. They allow managing the lifecycle of persistent storage via the Kubernetes-native resources like the PersistentVolumeClaim.

The runtime will monitor these claims and provision a concrete PersistentVolume instances, which will in turn be backed by concrete Volumes. Usually those volumes will automatically be formatted with a file system rather than being available as raw block devices.

Each persistent volume has a storage class, which determines the backing storage system. Many different storage classes can be available inside a single cluster, allowing for storage tailored to specific workloads.

Volume snapshots allow for creating copies of a volume. This can be very useful as backup mechanism and for migrations. If a storage class supports snapshots and how they are implemented depends entirely on the underlying storage provider, so some of the simpler providers will not allow for snapshots.

The final necessary step is actually making storage available to containers, which happens through the Container Storage Interface (CSI).

CSI drivers are independent executables that manage the lifecycle of a volume. They handle creation, deletion, resizing, and actually providing the volume to the operating system so they can be mounted into containers.

A driver receives instructions from the Kubernetes runtime (usually from the kubelet). Drivers must run on each Kubernetes node, so they are often deployed either directly via Kubernetes as DaemonSet or managed independently.

This should cover the basics. Consult the Kubernetes documentation for a more in-depth understanding.

Sidestepping the Problem: Object Storage

A lot of cloud-enabled application development has moved from using regular file system storage to using object/blob storage like AWS S3. Every other cloud offers a comparable service.

In a self-hosted environment you can take a similar approach, either by depending on a cloud service, or by self-hosting object storage.

Pairing With the Cloud

Pairing a self-hosted Kubernetes cluster with cloud object storage can be a viable solution, depending on your use case and motivations. You get the infinite scalability and reliability of cloud storage, while still hosting your own compute workloads locally.

This might be a good choice for some setups, but also comes with problems.

One potential downside is speed: accessing remote object storage will be a lot slower than over a fast local network. This might be fine for some workloads, but problematic for others.

Another considerable issue is egress cost. Most providers put a heavy tax on network egress, “encouraging” customers to keep most compute workloads inside the cloud. Depending on you usage patterns you can rack up a lot of egress charges quickly, negating potential cost savings of self-hosting.

This particular problem can now be mitigated thanks to R2, a recently launched object storage by Cloudflare. The distinguishing feature of R2 is that network egress is free (!), so you can transfer data without worrying about extra cost.

Self-Hosted Object Storage

The above may be fine if you are willing to depend on a remote service. But often the whole point of self-hosting is not depending on cloud providers!

Luckily there are multiple options for self-hosted object storage.

The most popular one is probably MinIO, an open-source but commercially supported product written in Go. It provides S3 API compatibility, AWS IAM policies for access management, and can scale from a simple single-node deployment to a distributed system with redundancy.

There are also other choices like seaweedfs.

You probably already noticed the obvious chicken and egg problem here: self-hosting object storage requires disk storage.

While it can sometimes be a viable solution to deploy MinIO et al independently from your Kubernetes cluster, you will probably want to manage these with the same Kubernetes tooling.

Solutions

Now we can start looking at some available solutions. There are a lot of possibilities out there, so the following is just a selection of popular choices.

They can roughly be divided into two groups: local volumes restricted to a single node, or distributed systems that provide redundancy and split/mirror data across multiple servers.

Bare Bones: Hostpath Volumes

By far the simplest solution are hostpath volumes. They just mount a directory from the host into a container, and don’t require any additional tooling.

Declaring them is as simple as:

  volumes:
  - name: test-volume
    hostPath:
      # directory location on host
      path: /data

This is simple and works everywhere, but is problematic for a lot of reasons:

  • Node specific: a hostpath will obviously be tied to a specific server, so your pods have to be pinned to a specific machine via settings like nodeSelector. They can only be migrated between nodes manually.
  • No size limits: a single volume mount/container can happily fill up the entire host partition.
  • Full host access: by default there are no limits on which host paths can be mounted, allowing containers to access or overwrite host data. Note: this can be limited via access policies
  • No isolation: all host paths run on a shared file system, with all the concurrency bottlenecks and file system issues that implies. A single directory can also be mounted into multiple containers, without any safeguards around concurrent access, which allows different pods to potentially overwrite each others data.
  • No Kubernetes-level management: hostpath volumes are just declared inline, and do not show up as resources like Volume or PersistentVolume.
  • No snapshot support
  • Helm charts and other deployment configs from third party services often are built with the assumption of PersistentVolume support , making them hard to use.

They are essentially a very primitive, free-form solution that is simple to use and might work for certain scenarios, but leaves all the plumbing, safeguarding and backups up to the user.

Thus they are not really recommended for any kind of serious usage.

“Managed” Hostpath Volumes: local-path-provisioner

A big problem with the simple hostpath solution above is that it does not provide “Kubernetes-native” volumes with PersistentVolumeClaim support.

To make this possible (while still retaining a very simple system) Rancher Labs has developed local-path-provisioner. It is installed and enabled by default on k3s and microk8s.

local-path-provisioner acts as a very simple CSI driver that supports creation and deletion of PeristentVolumes , and maps each created Volume to a simple directory on the host.

This is a considerable step up, but volumes are just stored in a host directory. Most of the downsides from above still apply. (node specific, no size limits, lack of isolation, no migration, no snapshots, …)

This is a viable solution for very simple clusters where you bind workloads to specific nodes anyway.

I still do not recommend it for any serious usage. There are better solutions available.

OpenEBS

OpenEBS is a much more powerful solution, and the first one I would actually recommend for production deployments.

OpenEBS has two different modes, local and distributed.

openebs - Local Volumes

In local mode, Kubernetes volumes still only exist on a single node, but there are multiple OpenEBS backends that can provide additional functionality.

The simplest one uses hostpaths again, and is pretty much equivalent to the local-path-provisioner introduced above. There is also a device backend that can use local block devices. There are not very interesting.

Significantly more so are the ZFS backend and the LVM backend.

Note: take care not to mix up Kubernetes volumes with LVM or ZFS volumes, which are separate concepts.

LVM - aka LogicalVolumeManager is an abstraction over disk devices. You can combine multiple disks into a a single “virtual” disk, but also split a single disk into an arbitrary number of volumes.

Each such volume appears as a separate block devices to the OS, and can be formatted with a separate file system.

LVM also supports snapshots of volumes with Copy on Write semantics, which means that a snapshot doesn’t require a full copy of the data, but just does a copy when the original data block is overwritten.

LVM is not tied to any particular file system, so you can layer ext4, BTRFS or any other filesystem of your choice on top.

ZFS offers similar functionality, but is also a full-fledged file system itself.

So why is this interesting in the context of Kubernetes?

OpenEBS can use LVM or ZFS to automatically create a separate volume for each Kubernetes volume. This provides:

  • No/minimal performance overhead: Compared to the distributed setups explored below, this setup comes with no or very little performance overhead, since the disks are still just mounted into containers directly. (LVM is a virtualization on top of disks, with a very small amount of block mapping overhead, and ZFS is a complete CoW file system)
  • Isolation: each volume has a specific size limit and an isolated file system (with LVM), so no chance of a single container filling up the entire host disk, and no potential performance bottlenecks due to a single FS
  • Snapshots: OpenEBS has integration with Kubernetes VolumeSnapshots, which makes backups much easier and provides integration with Kubernetes backup solutions like Velero.
  • Disk flexibility: you can always add new disks to your nodes if additional storage is required, and also remove disks if set up appropriately (mirroring, RAID, …)
  • Encryption: you can encrypt your disks, layer LVM on top and get a flexible solution while still keeping your data encrypted

This is not directly related to LVM or ZFS, but OpenEBS also supports migrating volumes between nodes.

So with OpenEBS + LVM/ZFS you get snapshots, isolation and migration.

OpenEBS also only has a small control plane that requires minimal additional resources.

These aspects make OpenEBS + LVM my preferred solution for many scenarios.

Of course there are still some downsides:

  • Requires familiarity with administering LVM or ZFS
  • Volumes and snapshots are still restricted to a single node, so switching workloads between nodes can require expensive migrations. (migrations do work automatically, though)
  • off-site backups are pretty much mandatory (unless the deployed service is natively distributed)
  • LVM usage is pretty straight-forward, but ZFS comes with a good amount of peculiarities.
OpenEBS - Replicated

OpenEBS also supports a distributed mode, where volumes are replicated across multiple nodes.

This requires an additional data engine layer provided by either Mayastor, Jiva or cStor. These each come with their own tradeoffs and complexities. Going into the details is out of scope for this article.

The major upside is redundancy and improved reliability. The downsides are increased complexity, increased storage requirements (because volumes are duplicated).

A very considerable downside for lightweight clusters is also a not insignificant performance overhead.

Deciding between local and replicated mode is only possible with concrete cluster requirements in mind.

You can learn more in the OpenEBS documentation.

Longhorn

Longhorn is a distributed block storage engine built specifically for Kubernetes.

A detailed architecture overview can be found here.

It works by running a service for each volume, which can then distribute read and write operations between multiple replicas stored on disks on multiple nodes.

Write operations create snapshots, which are then synchronized to each replica. It can be seen as an implementation of a simple distributed file system.

Longhorn is not dissimilar to replicated OpenEBS, but the exact semantics depend a lot on the specific storage engine.

Similar tradeoffs apply: more overhead, more complexity.

An additional problem (in common with replicated OpenEBS) is that these Kubernetes specific solutions are not as mature as other distributed solutions that emerged outside of Kubernetes.

Distributed File Systems: ceph and Gluster

There are multiple “serious” network enabled distributed file systems that are not tied to Kubernetes and run on general-purpose hardware.

The most popular ones are GlusterFS and cephfs.

Both provide easy integration with Kubernetes.

These are more mature and powerful than the Kubernetes-specific distributed options above, but are non-trivial to maintain and have a considerable amount of operational complexity and overhead.

They can be viable for larger clusters with serious data storage requirements.

Networked/Distributed Disk Arrays

The final option is for companies with bigger budgets.

There are multiple vendors that provide networked disk arrays, often also with decent Kubernetes integration.

All other presented options work on general purpose hardware, so expect a significant price difference if you go this route.

Conclusion

Persistent storage in Kubernetes clusters is probably the most complicated aspect of self-hosting.

There is a wide array of options, ranging from basic but limited, all the way to expensive custom hardware.

If you are looking for a quick recommendation: use OpenEBS + LVM when you don’t need distributed disks. Otherwise use ceph or GlusterFS. (or a hardware solution if you can afford it)

I hope this introduction has given you an idea of the available options. It is by no means comprehensive, the Kubernetes ecosystem is vast and sprawling.

If you think I’ve missed any important solutions, let me know! .