A production-grade Kubernetes cluster has many requirements. For example, applications and services should be able to communicate across nodes and be served to external users. Traffic from these users should be properly routed to underlying microservices according to the specified rules. Kubernetes applications should also be monitored to optimize performance and fix failures. These are just some of the requirements for production in Kubernetes clusters, but the built-in components of Kubernetes are not able to meet them.
Fortunately, Kubernetes was designed as a pluggable and extensible environment for cloud-native applications, so with proper planning, you can deploy third-party tools and applications to meet your production requirements.
In this blog post, we’ll review key third-party tools that help enable important services in your Kubernetes cluster: CNI-compliant networking; production-grade service discovery; microservices communication; observability (monitoring, logging, metrics visualization); and more.
By the end of this article, you’ll understand how the production Kubernetes cluster should look and how different open-source, cloud-native solutions can work together to ensure multi-host communication, observability, and good performance for your Kubernetes cluster.
Want to subscribe to our newsletter?
Thanks for your interest. You have been subscribed!
The Production-Grade Kubernetes Cluster: Main Requirements
Kubernetes provides many useful features: deployment primitives (such as DaemonSets, Deployments, or StatefulSets), services, and volumes that let you deploy and orchestrate containers out of the box. However, many critical components of the production-grade cluster, like security, load balancing, networking, and high availability, should be implemented or configured by cluster engineers.
This design is intentional. It constitutes the core of the Kubernetes pluggable, cloud-native architecture, which aims to provide flexible interfaces, primitives, and specifications for developers to use. The downside of this approach is that a vanilla Kubernetes cluster lacks features for production-grade applications.
Most importantly, Kubernetes lacks multi-host networking, a full monitoring pipeline, and a logging pipeline for applications. In addition, cluster engineers and application developers need tools for network security, ingress/egress, traffic management, and service discovery.
In what follows, we discuss third-party apps and tools you’ll need to install to address these requirements. We’ll omit the details of other important production cluster requirements, such as high availability, authentication, and role-based access control (RBAC), which can be implemented without third-party tools.
Enabling Multi-Host Networking
Kubernetes implements container-to-container and pod-to-pod networking within a single host, but it provides only general implementation guidelines for inter-host communication between pods. To enable multi-host networking, you’ll need an overlay networking tool implemented using a container network interface (CNI), which provides an interface for containers to communicate with each other according to well-defined rules.
CNI plugins can also help implement important K8s features such as network policies, and additional functionality such as traffic routing and IP filtering. There are many CNI-compliant plugins to select from, such as Calico, Cilium, and Flannel. You can find a full list here.
One of the best options for a production-grade K8s cluster is Weave Net, a CNI-compliant tool that creates a virtual network connecting Docker containers across multiple nodes and enables their automatic discovery. Also, Weave Net provides K8s network policy implementation, multi-cloud networking, dynamic topology (adding new nodes without reconfiguring existing ones), address allocation (IPAM), DNS, and much more.
Production-Grade Service Discovery
Vanilla Kubernetes provides many useful features for service discovery and load balancing, including Services and cluster DNS, a LoadBalancer primitive, and more. Also, Kubernetes Services can be exposed to the outside world using NodePort service type. However, Kubernetes lacks a comprehensive system for routing traffic inside the cluster, defining domains and subdomains, configuring complex ingress and redirection rules, traffic splitting, rate limiting, and other useful traffic management features.
To implement these features, you’ll need an ingress controller or edge router with reverse proxy functionality. One of the best solutions for Kubernetes is Traefik, which calls itself “a leading modern reverse proxy and load balancer that makes deploying microservices easy.”
Traefik can be configured to route traffic from entry points (e.g., URLs) to specific services in your cluster. In contrast to a traditional reverse proxy, Traefik employs services discovery to dynamically configure itself from the Kubernetes services.
When traffic reaches the cluster, Traefik can apply middleware to transform requests. You can use Traefik middleware to modify headers, set up authentication, redirect requests, or configure a request rate limit. Middleware can also be used to configure circuit breakers and automatic retries.
As a simpler alternative to Traefik, you can use Nginx together with a Kubernetes Ingress Controller, an API resource that allows you to define rules for external access to services in a Kubernetes cluster. You can configure Nginx as a reverse proxy and use K8s Ingress Controller to redirect requests to certain services according to user-defined rules.
Service Meshes for Microservices
Kubernetes and containers are great for a microservices architecture, an approach to application design where an application consists of functional components that can be developed, deployed, and updated independently of one another. To exchange information inside the app, microservices may form a network, often referred to as a “service mesh.”
As inter-service communication in a service mesh becomes more complex and grows in size, it becomes harder to manage and understand. Therefore, service meshes often require efficient mechanisms for service discovery, load balancing, failure recovery, monitoring, and network tracing. These services can be provided by an abstract layer for service-to-service communication, which allows you to control data traffic between microservices and proxy requests and to optimize networking performance without affecting the application code.
If you’re building microservices on Kubernetes, you’ll obviously need some service mesh solution. Istio, one of the most popular such solutions, makes it easy to design a service mesh with load balancing, authentication, monitoring, and distributed tracing without changes to source code. Istio deploys a special sidecar proxy that intercepts all network requests between microservices and processes them. It supports automatic load balancing; control of traffic behavior (failovers, retries, routing rules, fault injection); a policy layer with support for access controls, rate limits, and quotas; automatic metrics, logs, and traces for traffic; and secure service-to-service communication and authorization.
Enabling Observability for Kubernetes Clusters
Observability is an umbrella term for real-time understanding of the dynamics of a complex system. Its main goal is to enable efficient, secure, and fault-tolerant operation of a system and allow for fast intervention to fix critical errors, security issues, performance problems, and the like. The main components of observability include monitoring, logging, alerting, and metrics visualization. Observability is especially important for production-grade K8s, where multiple K8s components, user applications, and services interact in a complex way across nodes and even clouds.
As production workloads scale up to hundreds of pods, a lack of effective monitoring can result in an inability to diagnose hard failures that cause service interruption. This means that monitoring is a very important component of a production-grade Kubernetes cluster.
Some of the most popular monitoring tools for Kubernetes include Prometheus, Heapster, and proprietary application performance management (APM) tools like Sysdig, Datadog, or Dynatrace.
Monitoring Kubernetes clusters with Prometheus is a popular choice because it is deeply integrated into Kubernetes and the cloud-native ecosystem. For example, many Kubernetes components ship Prometheus-format metrics and can be automatically discovered by Prometheus by default.
Prometheus employs a pull model of metrics collection and auto-discovery. To enable auto-discovery, you just need to expose a metrics endpoint in your app, and Prometheus will automatically pull metrics from it. Then, Prometheus can process, analyze, and enrich these metrics to create a high-dimensional representation of data. Users can interact with these representations using PromQL query language, which lets you query time series, data vectors, data ranges, and other types of high-dimensional data generated from your metrics. Prometheus has great support for popular metrics providers such as GCE, Kubernetes, AWS EC2, Open Stack, and ZooKeeper Serverset.
Finally, Prometheus ships with an alert manager for handling alerts sent by user applications in Kubernetes. The alert manager can deduplicate, group, and route alerts to the correct receiver integrations, such as OpsGenie, PagerDuty, or email.
Kubernetes applications and components can produce hundreds of log files that should be aggregated, processed, and analyzed to be able to detect issues and deal with application failures, security events, and the like. This means that you need to deploy a log shipper connected to the remote log storage location for the production-grade Kubernetes cluster.
Fluentd is one of the most popular self-hosted log shippers for Kubernetes. It provides a unified logging layer for all logs produced in the cluster. Not only does it collect logs in raw form from log files; it can filter, add fields and labels, and enrich logs with new data. Fluentd has support for over 1,000 community-contributed plugins that connect multiple log sources and output destinations. It also has good performance due to in-memory and in-file buffering of logs.
You can also use Fluent Bit instead of Fluentd. Fluent Bit, like Fluentd, can work as a log aggregator and forwarder. Fluent Bit is faster because it has no dependencies (unlike Fluentd, which requires Ruby), and it has around 70 input/output plugins available.
Data Analysis and Visualization
The goal of observability is to get insights from logs and metrics. You can achieve this with powerful metrics and log visualization tools. Grafana is usually regarded as one of the best visualization tools for Kubernetes. It has default integration with Prometheus and supports PromQL. Using Grafana, you can apply metrics aggregation and statistics to data from multiple data sources and create dashboards with great diagrams and charts.
Kibana is an excellent solution for data visualization if you’re using Elasticsearch as the output destination for your metrics and logs. It has comprehensive support for Elasticsearch API, including metrics aggregations, statistics, powerful visualization tools, and language analysis.
Ensuring Efficient Serving, Observability, and Performance in Kubernetes
Which third-party components you should have in your Kubernetes cluster will always depend on your specific business use case. Microservices applications will definitely need a service mesh tool for managing inter-service communication. A simple monolithic app can work well with a simple Kubernetes Ingress Controller, whereas a larger and more complex application will definitely require a cloud-native reverse proxy and edge router such as Traefik.
Networking, monitoring, and logging tools are probably the most basic things you’ll need in your production-grade Kubernetes cluster. No matter what application you develop, they are central to ensuring efficient serving, observability, and performance of applications in your Kubernetes cluster.