VMblog 2022 Mega Series Q&A: StormForge CTO Discusses Kubernetes Management and Optimization

August 18, 2022

VMblog 2022 Mega Series Q&A: StormForge CTO Discusses Kubernetes Management and Optimization

Written by David Marshall

Welcome to the VMblog 2022 Mega Series where we'll be covering a number of important topics throughout the coming months. In this series, you'll be hearing from the industry leaders and experts in order to help you make important decisions within your own organization. Follow along for a chance to better understand a number of topics and find out more about some of the best technologies available out there in the industry.

In today's Q&A, we're speaking with industry expert, Patrick Bergstrom, CTO at StormForge. And we're diving into the topics of Kubernetes and DevOps.

VMblog: Tell us a little about your company. Where does it fit within the Kubernetes space, including containers as a technology and DevOps as an IT discipline? What are your key focus areas for 2022 and beyond?

Patrick Bergstrom: At its core, StormForge is an organization leveraging Machine Learning to solve a challenge we've faced in our own history - automating the configuration and provisioning of Kubernetes to ensure availability and reliability. Our goal for the past 7 years has been to build products that drive operational improvements through better software performance and more effective resource utilization, at scale, while working to reduce cloud waste.

Our team has direct experience with the complexity of Kubernetes applications in enterprise environments and the impact of that complexity on manageability, operating costs, and carbon emissions. A lot of the complexity and inefficiency has to do with the innate complexity that is Kubernetes infrastructure resource management. We've found that most organizations don't really address the problem for two reasons. First, it's incredibly hard to solve at scale. You can throw humans at the challenge, asking them to check the clusters and change a number of variables of the deployment configuration, but a lot of those variables often have an unexpected impact. Secondly, even if you do have engineers you can dedicate to the issue, we've found they tend to favor configurations for increased availability and reliability, which results in overprovisioning CPU, memory and other resources as a quick and easy way to avoid performance or scalability issues. Ultimately, this often results in a significant amount of wasted resources and unnecessarily high cloud bills, contributing to the global problem of cloud waste.

We're certainly not unique in our mission to reduce your cloud bill these days. We are, however, differentiated in our approach. StormForge is the only solution that leverages machine learning to drive not just intelligence, but actual automation and intelligent actions to solve these complex challenges for you. We don't tell you where to look, we tell you what we found and the configuration we recommend to optimize your infrastructure usage and reduce energy consumption, in a smart way, without sacrificing reliability. We follow an experimentation-based approach to optimization in pre-production environments and an observation-based approach in production environments. Both approaches leverage AI and ML, enabling automated actions that integrate with your CI/CD pipeline to drive efficiencies across your entire cluster. We deliver practical products that empower engineers to solve their most complex and stubborn operational problems in user-friendly ways, at any scale.

VMblog: We are here to talk about Kubernetes management for DevOps and production workloads. How does your company define and approach Kubernetes management?

Bergstrom: I think it's helpful to think of Kubernetes management in terms of a hierarchy of capabilities. At its core, Kubernetes looks to provide an environmentally agnostic, highly scalable, highly available platform to host your application workloads. It supports configuration management, deployment, and scaling those workloads in an automated fashion, however most people who have spent some time with Kubernetes quickly realize that there's a lot of stuff that still needs to be done if you want to make it all work effectively on a self-managed cluster. That's where Kubernetes management solutions come in, whether it's cloud based services from Amazon, Microsoft and Google or software-based management platforms for on-prem deployments like Tanzu, Rancher, Openshift, and others. They give you the ability to manage your environments, without having to get into the weeds of maintaining your platform. No matter which route you take, you're going to get some helpful tools that simplify installation, integration, monitoring, security and more.

But what happens when you start scaling up your cluster, with hundreds of unique workloads, operating on one or many different environments? The challenges of discovering, tracking, evaluating and troubleshooting problems becomes a challenge. Organizations often solve this complexity through observability platforms - for example, Prometheus, Datadog, Dynatrace, New Relic, Honeycomb. These tools do an excellent job of monitoring metrics, events and application traces across the entire infrastructure and software stack, and highlighting problem areas that need attention to ensure your application is running smoothly.

As you continue to grow, configuration management becomes increasingly problematic, very quickly, for organizations. Because Kubernetes resource management is so complex, most organizations simply don't consider trying to maintain efficiency and optimizing performance for their individually containerized apps or workloads.They're more concerned with ensuring every workload is reliable by providing more resources than are required to run it. It's not uncommon for a team to deploy a standard resource configuration across every workload in the cluster uniformly. We've found this is oftentimes up to 60%+ more than a particular workload actually needs. Very quickly this results in a significant waste of cloud resources and increased spend on hosting, or a waste of physical resources in a private data center.

To truly maximize your efficiency, there are many interrelated configuration variables that need to be tuned, on each individual workload, at every deployment. Today, too much of that tuning is left to ineffective trial and error processes that are slow and cumbersome for software engineers to manually execute, and for obvious reasons impossible to scale across massive clusters. Rather than asking engineers to spend cycles on properly tuning individual applications with Kubernetes configuration variables they're not experts in, most teams decide to take the easier path, and continue to throw money at the problem by over-provisioning resources as a best attempt to meet their SLAs and SLOs.

This is the exact scenario our ML-based optimization was designed for. We're able to take that valuable observability data from each individual workload and transform it into intelligent action, in an automated fashion. We give organizations the ability to take specific action and optimize their resources and find the right balance that meets their performance needs, at scale. We don't just point out workloads that are consuming too much, or find the waste. We tell you exactly how to handle it, and we can even handle it for you.

VMblog: What are the top challenges with DevOps implementations - particularly for container-based apps at scale? Which of these challenges is your company's technology addressing?

Bergstrom: A DevOps Digest article I read recently highlights that 95% of organizations that use cloud native technology have run into challenges in one form or another. Many of them talk about resource management challenges that occur in Kubernetes environments, and those are the ones that we're actually focused on solving for users.

Teams don't always have sufficient information to take intelligent action around resource optimization. Developers need to translate what they're observing into actionable insights. Too often engineers are expected to successfully configure a deployment with a focus that's exclusively on reliability. Without the right information, they have no choice but to over-provision in order to meet their SLA goals.

Engineers have too much on their plate, and are burnt out. The pressure to increase efficiency and reduce cloud spend while simultaneously addressing capacity constraints has pushed dev teams to their limits. When you focus engineer time on manual troubleshooting or optimizing complex Kubernetes environments you're making an already difficult work-life balance even harder. Recent studies show that almost 40% of engineers and architects claim their work makes them feel burnt out. Many organizations decide to simply reduce the focus on efficiency and cloud-spend, and instead settle for spending more on their cloud bills in order to focus their engineers' already limited time on the tasks they deem most critical.

Managing disparate container configurations effectively is hard. Container configuration plays a huge role in determining how a Kubernetes workload is going to perform. It requires looking at each individual application, then adjusting the dynamic container parameters that Kubernetes provides - which is a complex task and one of the biggest hurdles an engineer can face. Plus, your existing container implementations also need to be taken into consideration when assigning values to a new container - and that process can become quite lengthy and complicated. Even a slight error can affect all containers and result in a non-functional application. Add in the need to perform this function across thousands of containers and applications, and it quickly grows to the point that you can't solve it through people alone. As I mentioned, we've found many organizations that solve this by simply deploying a standard configuration for their containers, which is wildly overprovisioned for nearly all of their workloads.

Automation focused on disparate workloads does not exist today. Almost 40% of engineers say that developing and deploying applications for Kubernetes is a major challenge. And there are stats that reflect this difficulty. Almost 60% of engineers want to improve operational processes and nearly 50% want increased automation to solve those operational challenges. Nearly every engineer we've spoken with would leap at the opportunity to have an automated tool handle the configuration of their workload deployment for them. Today, the easiest way to automatically provision Kubernetes deployments is to use a standard configuration set.

In addition to technical decisions there are also challenges in business decisions. When engineers don't have the information to make resource allocation decisions based on business value, they err on the side of over-provisioning applications. We know this results in wasted resources and excessive cloud costs. To improve ROI, engineers need help making optimal business trade-offs between performance, cost, and quality - at scale. Now to be fair, most engineers aren't really focused on cost-related issues, so IT leaders need to bring that awareness to them and other stakeholders.

VMblog: Why has Kubernetes become the standard for cloud-native infrastructure - and how is it transformative?

Bergstrom: We talked about how a Kubernetes platform is complex and requires a great deal of knowledge and experience to master. The primary reason for this is because one of the objectives of Kubernetes is to make the deployment of the workloads running on that platform as simple as possible. Which means a lot of activities that were previously managed by applications are now handled by the Kubernetes framework. This includes security, redundancy, logging, network, error-handling, scaling and more. Now apps are lightweight and containerized, and this greatly reduces and simplifies deploying application code. This is transformative as apps can be created, deployed and modified easily for faster development and release cycles - especially in the cloud, where it's easy to spin up dev resources. This means products can be introduced more quickly and innovation can move faster than before.

VMblog: Are there certain management capabilities that aren't covered by native Kubernetes features? What are they?

Bergstrom: Native Kubernetes is certainly a great platform, especially if you have universally sized workloads that you're wanting to rebuild or deploy at a tremendous pace. It allows you to pre-configure your workloads to swiftly scale vertically or horizontally, and easily manage the underlying hardware running each node to ensure high availability. By design, however, as you scale up and out it expects you to leverage various automated tools to expand the platform, allowing you to deploy your workloads quickly in an automated fashion, with the proper configuration to ensure availability and reliability. That can be a daunting task for an engineer who would rather spend cycles thinking about the next feature he is required to build, and not about how that feature might change the resource requirements for their application. This is a big reason why "DevOps" is now considered a title in many organizations, and more than just a culture or mindset. These are the engineers who build the tools and capabilities to enable application teams to focus on writing code. They're typically agnostic to what the workload contains, and just specialize in enabling engineers to get their code to production as quickly as possible.

As with anything in today's environments, the underlying capability tends to be pretty straightforward, but certainly introduces added complexity as you scale your environments up. By layering on the right enablement tools, whether it's CI/CD, with ArgoCD, monitoring with Prometheus, or cluster-wide resource optimization with StormForge, there are plenty of tools to help organizations scale in a smart and cost-effective way.

VMblog: What are some of the trends around Kubernetes automation and ML-based resource optimization that people should be aware of?

Bergstrom: The biggest trend we talk about right now is the overuse of the term "machine learning", especially as it relates to resource optimization on containers. Many solutions out there promising "ML-based resource optimization" in reality use fairly basic statistical analysis that is a far-cry from true ML models, and they're not telling you how to optimize your cluster, let alone automating the entire process for you, from start to finish. They're simply pointing out where you might be spending most of your resources, or which namespaces are costing you the most. They don't tell you what impact changing those resources will have on your application's performance. You still have to make the critical decisions on what you're going to do with that information.

StormForge is the only organization today that has custom-built a ML algorithm to analyze a workload's history, learn its patterns, and recommend specific configuration actions both within Kubernetes and the application itself, that can be taken to improve both resource efficiency and application performance. Our recommendations can either be implemented automatically through our platform, as part of a CI/CD pipeline, or with a single-click approval in our UI, in order to minimize resource consumption and maximize application performance, at the same time.

VMblog: When should organizations start thinking about resource optimization in their Kubernetes adoption journey?

Bergstrom: Effective, ongoing resource optimization in Kubernetes environments is critical to long-term success in efficiency, application performance, and reducing cloud waste. It's important to consider application performance from day one. The challenge to that, however, is you won't have production data to work with in development or staging environments. It quickly becomes a lot of professional guesswork and often, wasted resources, when you're trying to make predictions that allow you to configure your container resources for production. Fortunately, that's where an experimentation-based approach works really well. Add to that a ML-driven model that can run hundreds of experiments on configuration combinations quickly and accurately, and you'll save a lot of time, money and headaches going forward. Machine Learning is perfect for this, as we've found with our patented ML models. In a single run we are able to effectively test the same number of combinations of configurations that would take the equivalent team years to recreate manually.

Clearly organizations are also running Kubernetes in production and have the same resource management issues that exist in pre-production. That's where they can still benefit from a solution that takes an observation-based approach that can analyze data and recommend changes based on what's happening right now in a live environment, as opposed to running experiments, which are more valuable in pre-production. Again, one of the benefits of a solution like StormForge is that we're able to help customers regardless of what stage of the product development lifecycle they're in AND when they're already in production.

VMblog: What are some interesting challenges that your customers are addressing with your product? What specific problems are reduced or eliminated by your solutions?

Bergstrom: The piece that's most interesting to me, is the fact that our platform is built on a true ML algorithm that's custom built for this. As a result, we find some really interesting suggested configurations that people wouldn't normally consider. More than once our platform has suggested a configuration change to an application where a customer has said "Why would we change that? That will have a negative impact on production." based on their previous experience with specific configuration sets. They decide to approve the change, and are blown away when it not only saves them resources, but actually improves the performance of their application. We often forget as engineers, all our collective experiences shape what we believe to be the best path forward to build and maintain our applications, and configure them. Sometimes when we give an ML model carte-blanche to experiment with settings, it learns something we didn't expect.

Ultimately, we now have customers who rely on our system to determine resource configurations for all of their workloads, across very large clusters in production, as part of an automated pipeline. They're reporting millions of dollars in reductions on their hosting costs, along with increased application performance and availability. That's the piece most folks are astonished by. The operating assumption is that in order to save money, you have to sacrifice performance or availability, but it turns out with the right configuration improvements you can make gains on both sides of the equation.

VMblog: How are you different from your competitors? Why would someone prefer your offerings to others?

Bergstrom: I think part of the answer has been woven into my earlier responses, but let me summarize:

StormForge is really the only holistic solution that closes the complexity gap of resource configuration, in both pre-production and production, to support automated optimization of each workload across your entire Kubernetes cluster, using true machine learning. We don't just tell you where to look, we offer to take specific actions on your behalf to meet your goals, whatever those might be.

We're the only ones with a patent-pending Machine Learning engine. This is what enables us to provide a level of sophistication that goes above and beyond the basic statistical modeling used by other solutions. Our insights, recommendations and actions are precise and the results can be measured and graphically displayed vs actual resource usage.

And again...observability platforms and performance monitoring tools are really important for understanding what issues you've got and where they're located. But our technologies take the next important step toward resolution. StormForge is a proactive technology that lets you automate the change or remedy that comes from observing and analyzing data. In that way, it's the perfect partnership - marrying observability and APM tools with StormForge ML-driven optimization. This is a big differentiator for StormForge technology and a game-changer for our customers.

Last modified on August 18, 2022

Published in Mega Series 2022

Tagged under

David Marshall

David Marshall has been involved in the technology industry for over 19 years, and he's been working with virtualization software since 1999. He was able to become an industry expert in virtualization by becoming a pioneer in that field - one of the few people in the industry allowed to work with Alpha stage server virtualization software from industry leaders: VMware (ESX Server), Connectix and Microsoft (Virtual Server).

Through the years, he has invented, marketed and helped launch a number of successful virtualization software companies and products. David holds a BS degree in Finance, an Information Technology Certification, and a number of vendor certifications from Microsoft, CompTia and others. He's also co-authored two published books: "VMware ESX Essentials in the Virtual Data Center" and "Advanced Server Virtualization: VMware and Microsoft Platforms in the Virtual Data Center" and acted as technical editor for two popular Virtualization "For Dummies" books. With his remaining spare time, David founded and operates one of the oldest independent virtualization news blogs, VMblog.com. And co-founded CloudCow.com, a publication dedicated to Cloud Computing. Starting in 2009 and continuing all the way to 2016, David has been honored with the vExpert distinction by VMware for his virtualization evangelism.

VMblog 2022 Mega Series Q&A: StormForge CTO Discusses Kubernetes Management and Optimization

David Marshall

Latest from David Marshall

Related items