Kubernetes is an ideal platform for stateless workloads, but the ramping up of stateful workloads presents a challenge for disaster recovery. Here’s a new way to tackle it.
By Alex Tesch, Senior Consultant, Cloud Native Computing Practice, HPE Advisory & Professional services.
With the advent of Kubernetes and its ramping-up adoption in the enterprise, countless operation teams are left in the limbo when it comes to well-known practices for backup/restores and disaster recovery (DR) on container platforms.
Kubernetes is an ideal platform for cloud-native applications which are stateless in nature and can be easily re-deployed upon failure on separate machines that form part of the same cluster. It has faced exponential growth with the onboarding of stateful applications – known for ever-changing data that needs to survive the death and rebirth of the pod hosting the data service.
These stateful applications rely on a persistent storage backend that is capable of keeping the data intact after a pod failure and then presenting the same data to the new pod scheduled, providing service continuity. This happens in the background with the use of Kubernetes APIs known as Persistent Volumes (PVs), which represent a logical volume or storage LUN, and Persistent Volume Claims (PVC), a formal request by a pod to consume the storage provided by the persistent volumes.
Those APIs need to be storage-agnostic, as many vendors compete to provide drivers that allow their arrays to be consumed by Kubernetes. These drivers are known as a Container Storage Interface (CSI) that is instantiated through a storage class before the volumes can be dynamically requested by the stateful workloads at scheduling time. HPE has its very own CSI driver, which is capable of integrating HPE 3PAR, HPE Primera and HPE Nimble storage arrays with Kubernetes; the driver is open source and can be found at https://github.com/hpe-storage/csi-driver.
Now that we understand that Kubernetes is capable of running stateful workloads, the next challenge is how to backup those workloads in case of a total cluster loss. This was never a challenge with stateless applications because the lack of changing data makes it possible to just reschedule the applications in a new cluster and continue providing the service. However, with stateful applications, there is a need to make sure that the data is backed up in case of a disaster.
With the release of Kubernetes 1.17, three new API resources named VolumeSnapshot, VolumeSnapshotContent and VolumeSnapshotClass were provided as a Beta feature. These API are provided as custom resource definitions (CRDs) and are not part of the core Kubernetes APIs; they are implemented through the CSI drivers instead. By using those APIs, Kubernetes is capable of initiating snapshots of existing PVs. The snapshots will reside in the same storage backend as the data being snapshotted, and the snapshot represents a point-in-time copy of the data at the time the snapshot was triggered.
A full-fledged disaster recovery solution must be able to store those snapshots in a backup target external to the Kubernetes cluster so that it can be restored in a separate cluster in case of a total loss. The platform should be able to schedule snapshot backups at specific times (known as backup windows), and it should be able to export those snapshots into an S3 compliant storage, in the likes of HPE Ezmeral Data Fabric or Scality object storage.
HPE has partnered with Kasten , a company which sells K10 – a tool dedicated to performing backup and restore of stateful workloads running on Kubernetes clusters. (Read more about the HPE-Kasten partnership.) Kasten is able to integrate with major CSI drivers, including HPE CSI, to trigger snapshots of persistent volumes by leveraging the APIs available since Kubernetes 1.17. Kasten integrates as well with S3 API compliant storage and performs exports of the snapshots taken, so that they can be restored later in a separate cluster in case of disaster.
Once we have a Kubernetes cluster available in the DR location, it is possible to deploy Kasten K10 there pointing to the same S3 storage backend where the backups are stored. From Kasten K10 in the remote location, we can configure restore policies that can be triggered automatically at preferred times so that the application is available in a remote location – ready to provide services with the latest data available from the snapshot export taken from the original production Kubernetes cluster.
Once we have this infrastructure in place, standard Recovery Point Objective (RPO) and Recovery Time Objective (RTO) SLAs can be achieved by tuning the backup/restore windows according to the amount of data involved in the Kubernetes stateful workload that needs to be protected. Disaster recovery drills are highly recommended on a periodic basis to certify that mission critical data will be available in case of a site loss.
HPE Pointnext Services can help you get the most out of your Kubernetes disaster recovery strategy. We understand that once cloud native workloads reach production maturity, care must be taken to achieve business continuity when live data is hosted in Kubernetes. HPE Advisory and Professional Container Adoption Services can help your team to design the right disaster recovery infrastructure that will allow you to achieve the most aggressive RTO and RPO goals.
To learn more, see our HPE Container Adoption Service solution brief.
Learn more about technology consulting services from HPE.
Alex Tesch has been working with open source enterprise technologies for the most part of his 21-year IT career in companies including Red Hat, IBM, Sun Microsystems, and now Hewlett Packard Enterprise. Alex is currently an APJ lead consultant in the Hybrid IT Center of Excellence at HPE, where he designs and evangelizes cloud native solutions that help customers to modernize their infrastructure and adopt new best practices to leverage next-generation IT.
Hewlett Packard Enterprise