This section contains tutorials covering various use cases from development to operations.
This is the multi-page printable view of this section. Click here to print.
Tutorials
- 1: Application Operations
- 2: Application Readiness
- 3: Application Observability
- 4: Distroless Container Images
1 - Application Operations
Netic will do operations and management of Kubernetes as well as cluster wide components referred to as technical operations and management. However, monitoring and reacting to events from the deployed applications are not covered by this - this is referred to as application operations.
Getting Started
Application operations is basically setup following the below steps:
- Define incident scenarios which requires human interaction and write the standard operating procedures to be followed
- Identify and ensure metrics are available to detect the incident scenarios
- Develop alerting rules based on the metrics
- Handover alerting rules and standard operating procedures to Netic for verification and activation
Often new scenarios are discovered along the way extending the set of alerting rules over time.
Incident Scenarios
The failure scenarios may vary a lot from application to application. Strart out by trying to identify the most critical points in the application and how to detect failure in these. It is important to write a standard operating procedure to go together with every failure scenario describing the steps to be taken - even if this is to wake up the developer on duty.
Metrics
Metrics can be either specific application metrics such as rate of caught exceptions, http response error rates or other similar but metrics can also be core Kubernetes metrics such as memory or cpu consumption. The monitoring must be based on metrics rather than log patterns since metrics is much more robust to change. If the application provides application specific metrics the application must expose these see Observability.
Alerting Rules
Alerting rules are written in PromQL as the rules will be evaluated by the Prometheus rule engine. The rule expression can be testet against the real metrics using the Grafana Explore mode. Note that it is also important to consider the robustness of a rule with respect to pod restarts, flapping etc.
Handover
Once standard operating procedure and alerting rule expression is in place it can be handed over to Netic. Both will be validated and reviewed such that the 24x7 operation center is able to follow the operational procedure and such that the concrete alerting rule is capable of triggering a correct alert. This may cause a few iterations the first times.
2 - Application Readiness
Netic recommends for running workloads inside of Kubernetes some which is enforced by the policies of the secure cloud stack and some which is best-practice running Kubernetes. These recommendations are always valid but especially so if Netic is to provide application operations.
Security
The containers must be able to run under the following security constraints also enforced by the pod and container security context (see also Security Context).
- Running without Linux capabilities
- Running as unprivileged
- Impossible to do privilege escalation
Stability
The following concerns the ability to run the an application stable on Kubernetes even when the cluster is undergoing maintenance.
- Number of replica must be >1
- Readiness and liveness probes should be properly set
- Resource requests and limits must be set according to expected load
- Pod disruption budget should be present and allow for maintenance to be carried out (min available < replicas)
- If applicable persistent volumes should be annotated for backup
Documentation
The following concerns recommended (and required) documentation.
- Define requirements for backup; retention and restore
- Well-defined restore procedure including possibly acceptable dataloss
- Alerting rules and associated standard operating procedures must be defined
Resilience and Robustness
The following concerns the resilience, robustness and compliance.
- Cross-Origin Resource Sharing (CORS) headers are not automatically set - remember if applicable
- Observe correct use of http protocol with respect to idempodency etc. to allow for retries and other outage mitigation
- Utilize fault injection to make sure clients are resilient to unexpected conditions
- Beware to avoid sensitive log information (GDPR and otherwise)
Testing Application Operational Readiness
Prior to engaging in application operations Netic offers a workshop to assess the operational readiness of the application based on the outlined points.
3 - Application Observability
The secure cloud stack comes with a readymade observability setup to collect logs, metrics and traces and gain insights into application health and performance. While the platform as such is polyglot and works independt of specific programming languges, there are some recommendations with respect to development.
Before you begin
This guide assumes some familarity with the concepts of cloud native observability, i.e., logs, metrics, and traces as well as the chosen programming language.
Logs
Basically there is no requirements on logging. All that is output to stdout/stderr will be forwarded to the log indexing solution. However, it is recommended to use a logging framework to make sure the log output is consistent and allowing for more easy log parsing afterwards. Below are examples of common logging frameworks for a few languages.
It is worth mentioning that OpenTelemetry is also working on standardizing logging across languages however only alpha support currently exists for a few languages.
.NET
The .NET framework comes with logging interfaces built in and a number of 3rd party solution can be hooked into to support controlling the log output. Examples are:
Go
The standard Go libraries for logging is very seldom sufficient and a number of logging frameworks exists. Popular ones are:
Java
Java also comes with built in logging support in the java.util.logging
(jul) package though a number of 3rd frameworks
are also very popular. Interoperational bridges exists between these and also between these and the built-in Java support.
Metrics and Traces
While metrics and traces are different concepts there are some overlap. Metrics are a quantitative measure aggregating data, i.e., a counter of requests or a histogram of latencies. Distributed traces are a qualitative measure recording the exact execution path of a specific transaction through the system. However, often it is desired to record metrics in almost the same places as a span is added to a trace. This makes a natural coupling between traces and metrics. Also support is coming for enriching the aggregated metrics with trace ids representing examples, like a histogram bucket of a high latency may be reported along with an trace id of a transaction with high latency.
While the platform does not put any constraints on trace or metrics frameworks by default it is recommended to use and follow the OpenTelemetry recommendations. The OpenTelemetry project both support libraries for multiple languages and also standadizes recommendations on naming, labels etc. This allows for more easy reuse of dashboards, alerts, and more across applications. The instrumentation libraries implement the standard metrics.
.NET
Go
- OpenTelemetry
- Prometheus official client_golang
Java
- OpenTelemetry
- Prometheus official client_java
What’s next
- Activate telemetry collection - see Observability
4 - Distroless Container Images
Usually source code is compiled and added as a new layer on some existing container base image. Some programming languages require some interpreter to run like a Python interpreter or a virtual machine running Java bytecode.
It is convenient to use a base image populated with normal *nix tooling and maybe even based on a known Linux distribution such as Ubuntu. This allows for easy debugging by executing commands inside of the running container image. However this also expands the surface of attack both with respect to the number of tools and service that might contain vulnerabilites but also the tools aviailable should someone be able to execute arbitrary commands within the running conatiner.
At the same time the more utilities and libraries that exists in the images the bigger the image becomes. The size in itself is not a problem as such however size do matter when it comes to startup times and also the amount of storage required both on the Kubernetes worker nodes as well as in the container registry.
To reduce both attack surface and size it is recommended that production images are built based on distroless base images - if at all possible. Google provides distroless base images for a number of interpreted and compiled languages see distroless.