Blog – AI solution for auto-scaling on Kubernetes with Datadog

With Datadog, our enterprise customers are now able to monitor their application workloads and get visibility into Kubernetes clusters of any scale.  The new integration between and Datadog has made it easier to deploy our service, collect and monitor any metrics (e.g. CPU and memory utilization, application latency etc) for applications running on Kubernetes. Similar to our Operator, Datadog has received RedHat OpenShift Operator Certification, meaning that it has been tested to work with OpenShift and screened for security risks.

Many of our enterprise customers today are running hundreds or even thousands of containerized applications at the same time while sharing a common pool of cloud resources on-premises or in the public clouds.  In addition, these application workloads are quite dynamic in nature and sometimes could increase drastically (10-100x) during specific periods of time, and accordingly, the resources should be increased during such periods and then be decreased afterwards. However, enterprises typically do not understand how much resource is needed to support each of their application workloads, and in order to maintain the service levels, they could only resort to over-provisioning, thus under-utilizing and wasting their cloud resources.  With the application workload metrics collected by Datadog, our enterprise customers want to turn such information into actionable insights – how to determine the right amount of cloud resources at the right time to support each of their many applications, each with different workload and service-level requirements.  This is not a task that an enterprise IT team or application developers are able to manage manually.

How works with Datadog

For any given application, the Datadog Agent monitors and collects many different types of metrics (e.g. CPU, memory utilization), some of which could be application-specific (e.g. length of a message queue) or user-defined (custom metrics). After is installed, its data-adapter will query Datadog service to collect relevant metrics of the application, and its AI analytic engine will then perform time-series and correlation analysis on the metrics and send back its prediction results to Datadog, which can then display the information on the Datadog Dashboard.

The prediction results from include, for example, CPU and memory resource recommendation for running a specific application or project in a future period of time. The time scale for the prediction could vary from minutes to hours or even days and weeks, depending on the application characteristics and the use cases.

Auto-scaling: A good use case of the prediction results from is for auto-scaling pods in Kubernetes. Horizontal Pod Autoscaler (HPA) in Kubernetes has been a key feature in allowing Datadog to scale its platform to keep up with the growing user base, and Watermark Pod Autoscaler (WPA), an open source project created by Datadog, extends the features of the HPA to give you more control over autoscaling your clusters. The HPA employs an algorithm that generates a metric based on the current resource (e.g. CPU) utilization level, and compares the metric to a pre-specified threshold to decide how many pods to increase (or decrease) in the next time interval. The WPA extends the HPA’s algorithm by allowing you to define a range of acceptable metric values (an upper and lower bound), rather than the single threshold that the HPA uses to trigger scaling actions.

Instead of generating the metric based on the current resource utilization level, uses advanced machine learning algorithms to predict the right metric in the next time interval, based on various metrics collected by Datadog to learn about the application characteristics and workload patterns from previous time intervals.  By feeding the right metric to WPA, this sophisticated AI solution of enables you to scale the right amount of pod resources at the right time in supporting future workloads of the application.

While can be used for predicting workloads and metrics on any containerized application on Kubernetes, it can also take advantage of application-specific metrics for common applications, further enhancing its efficiency and prediction accuracy in auto-scaling. Many of the application-specific metrics are coming from Datadog’s 400+ integrations.

The figure below shows how works together with WPA to achieve a good auto-scaling solution, using Kafka as an example application.

  • Text Hover


In this example, Datadog Agent installed in the cluster sends standard metrics including CPU and memory utilization, as well as Kafka-specific metrics (e.g. producer rate, consumer lag etc.) to Datadog services. These metrics are continuously collected by’s data-adapter, which sends back the prediction results and recommendation to Datadog services, after the metrics are analyzed by the AI engine of  The Datadog Cluster Agent gets the prediction results and recommendation from Datadog services, and such information is queried and pulled by the WPA to execute the pod scaling in the cluster. This process will continue as long as auto-scaling is needed for the application. Also, can be installed in the same Kubernetes cluster with the application, or it can be installed outside the cluster and deliver the prediction results as SaaS. The following is a screenshot of the Datadog Dashboard showing the results of autoscaling the consumer replica set of an example Kafka application with and Datadog, including also the actual and prediction of the production rate and consumption rate, as well as the consumer lag, latency, CPU and memory utilization.

  • Text Hover

In addition to auto-scaling any generic containerized application, will continue to take advantage of application-specific metrics and enhance the efficiency of auto-scaling common applications on Kubernetes, including for example, NGINX, PostgreSQL, etc.  Moreover, there are other use cases of using the Datadog platform, including capacity planning, resource optimization, etc., which we will discuss in a future blog.

In conclusion, using the advanced AI solution of and the rich set of metrics collected by Datadog, our enterprise customers are able to manage, optimize and auto-scale their cloud resources for any application on Kubernetes. For further information on the setup and integration with Datadog, please visit our website