They say bad things always come in threes and get worse as you go along. If the first one is an eye-watering invoice from your cloud provider, and the second is finding out your system isn't as resilient as you thought, then things aren't looking good. Sadly, for the growing number of organizations using Kubernetes as the key platform for running containerized workloads in the cloud, this is an increasingly common start to the day.
It wasn’t until 2016 that the concept of state was reliably implemented in Kubernetes. Before this, Kube was largely intended as a container orchestration platform for running, scaling and optimizing stateless compute workloads. Today, it is perfectly possible to run stateful applications within Kubernetes. A recent survey by the Data on Kubernetes Community (DoKC) suggests that 70 percent of organizations using Kubernetes are now happily running data-centric, stateful applications.
However, many organizations are still clinging to the memory of where Kubernetes and Containers started: they remain fearful of storing data within an apparently ephemeral infrastructure. As a result, they increasingly rely on networked block storage, hosted file systems and database-as-a-service (DBaaS) offerings from their cloud provider, assuming these will keep their data safe.
Sitting outside of Kubernetes, these solutions do offer highly reliable storage, but they also come with their own significant drawbacks. Firstly: the fact they sit outside of Kubernetes - they are not ‘Kube-Native’, not built to work with Kubernetes’ architecture. This impacts both application resilience and performance. The biggest drawback of cloud provider storage services are their impact on operating costs, potentially increasing storage costs by as much as 200x, but we will explain this with a trip to the AWS pricing calculator at the end of the article.
Looking at resilience, within Kubernetes’ architecture the scheduler plays a core role in placing and sharing workloads between nodes. This can be done to co-locate workloads more efficiently and reduce compute costs. But more importantly, the scheduler’s ability to reassign workloads is an essential part of Kubernetes’ legendary resilience: a node fails, and the scheduler seamlessly reassigns all of its workloads around the remaining nodes in the cluster.
Now there is a big ‘however,' if you are running a stateful workload on Kubernetes and storing data in an external hosted service from your cloud provider, these services don’t always have a great relationship with the scheduler. When a cloud instance disappears the Kubernetes scheduler has to ask the underlying cloud for a replacement; and once that instance is created, the Kubernetes scheduler then needs to reschedule workloads and ask for any storage to be re-mounted to the new node. Creating new instances, installing the kube components and joining a cluster takes time; and that's before you get to re-mounting storage. These steps add up, and open up the very real potential for up to fifteen minutes of application downtime. This even causes delays simply re-scheduling workloads on existing nodes, as unmounting and re-mounting cloud storage creates bottlenecks.
Moreover, if your cloud provider suffers an availability zone (AZ) failure, in many clouds all networked storage services are locked to a specific AZ. This means that, even where you have Kubernetes Nodes spread across multiple AZ for high availability, stateful applications will all grind to a halt indefinitely. Kubernetes will reschedule the application, but storage remains locked to the failed AZ.
Then there is application performance. If you are using any kind of storage, your I/O is restricted by network speeds and individual disk bandwidth. Even if you pay your cloud provider to crank this up (and, oh, how you pay), there is an inherent limit that doesn’t sit well with most business-critical systems. Try stacking multiple latency-sensitive applications on the same node and the problem gets worse.
But, as you may have suspected, there is another way. By using a Kube-Native data layer, organizations can eliminate these issues. The best of these solutions are open, allowing users to leverage any storage they choose on the back end. If organizations are wedded to the idea of using networked storage services, the data layer ensures that these work in unison with the Kubernetes scheduler. No more application downtime as rescheduled workloads moves around. The scheduler is free to shift, stack and restack workloads reducing compute costs, and failover becomes near-instantaneous.
Looking at performance, an effective data layer can also parallelize I/O performance of connected storage. Users can effectively bond networked storage connections to deliver compound I/O performance to a single host. Organizations can ‘play the pricing’ utilizing the same cloud provider services more effectively, reducing costs and getting a more resilient service into the bargain.
And so to those eye-watering invoices we started with. The gem that we have saved until last, is that networked block storage, hosted file systems and DBaaS (generally in that order) rise in price from simply eye-watering, right up to soul-crushing. And you don’t even need them. You can store data resiliently and securely within Kube - using local disk space on the compute nodes.
At first glance, this may seem like mild insanity. Compute nodes are ephemeral in the cloud. They can, and do, disappear in a flash. Not great for storing business-critical data, you might think. But Kube-Native data layers use proven methods from the storage industry (similar to RAID methods) to stripe data across multiple nodes. These methods such as synchronous replication and Erasure Coding have been tried and tested to protect users against disk failure in both traditional data center SAN solutions and more modern Software-Defined Storage (SDS) solutions.
It is worth noting in a brief aside, not all of the methods used to make data highly available are ideally suited to a Cloud Native Kubernetes environment. While Erasure Coding is generally more efficient in its use of disk space, it also places high demands on both bandwidth and compute cycles whenever it is required to rebalance after a node failure. More straightforward 1-to-X synchronous replication is preferable where application performance, recovery times and reducing the demands on the limited bandwidth and compute cycles on Kubernetes nodes are key considerations. But aside from the merits of the specific technologies used within a Kube-Native data platform, the ability to leverage local disk on your compute nodes to store application data is a true game-changer in terms of cost.
Firstly, organizations that have been using networked storage, DBaaS and other services are likely to have a swathe of underutilized disk space already available on existing compute nodes. So, effectively, free storage. Cloud Provider pricing models often mandate larger amounts of storage as part of the package, where users want higher-performing nodes. This is especially true with files system services such as AWS EFS. The result is that this existing free storage pool can often be quite substantial.
Even ignoring this existing unutilized local storage, we come to the comparative cost against cloud provider storage services. Here the easiest option is a trip to the AWS pricing calculator (we are not picking on AWS here, other cloud providers will show a very similar pattern):
100GB of the highest performing EBS networked storage will cost $9,663.70 per month, running with a relatively poorly performing storage medium. By comparison, an M6gd.large instance comes with a comparatively performant 118GB NVME SSD for $41.61 per month with the added advantage of being a compute node. Yes, you read correctly, EBS is 200x the price per GB compared to higher performing local storage (with a free compute node thrown in).
So what does the pricing calculator tell us if we are looking at optimizing performance rather than cost on its own:
For the most demanding IO workloads, i3 instances offer 8x 1900GB NVME disks. These are an order of magnitude faster than the best performing EBS volume, and cost $2501.71 per month (1 year commitment, no upfront RI). Three instances would offer a total of 45TB of high-speed disk, 192 disks, and 1.5TB of RAM, for less than the cost of 100GB of standard EBS storage. And this time you have 64 virtual CPUs thrown in.
At this point we should perhaps give you a while to let these figures sink in. But there is more. As we commented above, networked storage services can be eye-wateringly expensive, but managed filesystems and DBaaS services take costs to a new level. The best Kube-Native data platforms can now be integrated with Kubernetes Operators for all the most popular open source databases and other stateful frameworks.
Integration with the Operator Framework makes things simple for Kubernetes admins and platform engineers, and lets them provide developers with simple self-service provisioning for Postres, MySQL/Maria DB, MongoDB, Redis, Cassandra, in short, all the favorites. These solutions can be built safely, resiliently, securely and with higher performance than hosted alternatives using local storage on the node. Some of the most expensive cloud provider services can be upgraded, at a fraction of the cost, using more cost-effective offerings from the same cloud provider.
Running stateful applications in the public cloud no longer ties organizations to their cloud providers’ data services. By installing a couple of new containers on a node, Kube-Native data layers give users new levels of control over the services they consume. They can enable new architectural possibilities for improved resilience, performance and security. But perhaps most importantly they allow organizations to leverage local disk/SSD storage at a tiny fraction of the cost of hosted storage and database solutions.