This is the multi-page printable view of this section. Click here to print.
Docs
- 1: Getting Started
- 1.1: Getting Access
- 1.1.1: Netic on Premise
- 1.1.2: Azure Kubernetes Service (AKS)
- 1.2: Security Context
- 1.3: Ingress
- 1.4: Network Policies
- 1.5: Observability
- 1.6: Vault and Secrets
- 1.7: Stateful Deployments
- 1.8: Cluster Workload Determinism
- 1.9: Image Automation
- 1.10: Relevant URLs
- 2: Tutorials
1 - Getting Started
This section introduces some of the core concepts to utilizing Kubernetes with secure cloud stack on top.
Before you begin
This guide expectes the following prerequisites:
- Access to a user authorized for the namespace - see Getting Access
- Familiar with core concepts of gitops
Verifying Access
Deployment is based on git and gitops - specifically Flux. A namespace must already
have been setup. It is possible to find the specific reconciliation setup for a namespace using kubectl
.
Getting the gitrepo
resource will display the repository associated with the namespace as well as
the status for pulling in changes.
kubectl get -n <namespace> gitrepo
Getting the Kustomization
reousource will display status of applying resources in the cluster.
The specific path within the git repo used for reconciliation can also be found in the
Kustomization
resource.
kubectl get -n <namespace> kustomization
You are now ready to deploy by pushing commits to the git repository.
What’s next
1.1 - Getting Access
1.1.1 - Netic on Premise
Getting access to Netic managed and operated Kubernetes cluster on-prem requires a few steps.
Before you begin
This guide expectes the following prerequisites:
- A namespace has been created associated with a git repository for gitops based reconciliation
- Access to a user authorized for the namespace/cluster
kubectl
has been installed- The kubelogin plugin has been installed
Access to Cluster
Access to a Kubernetes cluster requires a kubeconfig. Authentication and authorization is based on
OIDC and it is possible to download a kubeconfig file from your observability
dashboard at https://<provider_name>.dashboard.netic.dk
. The downloaded
configuration depends on the kubelogin plugin to be installed. The plugin is
capable of requesting and caching an OAuth 2.0 access token.
When you sign into Grafana the first page you are met with shows you the kubeconfig file for the clusters and namespaces you have access to.
It is possible to check access using kubectl
kubectl auth can-i --list -n <namespace>
Create kubeconfig manually
If you prefer, you can create the kubeconfig file manually.
Replacing the <>
-tokens with their corresponding values, create the following
kubeconfig.yaml
file:
apiVersion: v1
kind: Config
preferences: {}
clusters:
- name: default
cluster:
certificate-authority: <api-server>.crt
server: https://<api-server:port>
users:
- name: keycloak
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
command: kubectl
args:
- oidc-login
- get-token
# This allows for authentication on, e.g., bastion host. Disabled on
# local workstations.
# - --grant-type=authcode-keyboard
- --oidc-use-pkce
- --oidc-issuer-url=https://keycloak.netic.dk/auth/realms/mcs
- --oidc-client-id=<cluster_name>.<provider>.<cluster_type>.k8s.netic.dk
contexts:
- context:
cluster: default
user: keycloak
name: default
current-context: default
Then, get the certificate from the api server.
Again, replace <>
-tokens with the proper values.
true | openssl s_client -connect <api-server:port> -showcerts 2>/dev/null \
| sed --quiet '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' \
> <api-server>.crt
Using the configuration you can start using kubectl
:
kubectl --kubeconfig <api-server>.yaml get nodes
What’s next
1.1.2 - Azure Kubernetes Service (AKS)
Getting access to Netic managed and operated Kubernetes cluster in Azure requires a few steps.
Before you begin
This guide expectes the following prerequisites:
- A namespace has been created associated with a git repository for gitops based reconciliation
- Access to a user authorized for the namespace/cluster
kubectl
has been installed- The azure-kubelogin plugin (from k8s 1.24 onwards) has been installed
Access to Cluster
Access to a Kubernetes cluster requires a kubeconfig. Authentication and authorization is based on OIDC. The configuration depends on the Azure kubelogin plugin to be installed. The plugin is capable of requesting and caching an OAuth 2.0 access token.
For Azure you can get the kubeconfig file for the clusters you have access to using the following commands:
az login
az account set --subscription <subscription id>
az aks get-credentials --resource-group <resource group name> --name <aks service name> -f <output file name>
It is possible to check access using kubectl
kubectl --kubeconfig <output file name> auth can-i --list -n <namespace>
What’s next
1.2 - Security Context
By default a namespace is setup to adhere to the Restricted Pod Security Standard. Your deployment must be configured to adhere to this to be accepted for deployment otherwise the pods wont be created.
Before you begin
The manifests for deploying the workload inside of the cluster is available.
Adjusting Deployment
Having a deployment like so:
apiVersion: apps/v1
kind: Deployment
metadata:
name: verify-deployment
labels:
app.kubernetes.io/name: verify-app
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: verify-app
template:
metadata:
labels:
app.kubernetes.io/name: verify-app
spec:
containers:
- image: nginxinc/nginx-unprivileged:1.20
name: verify-app
ports:
- containerPort: 8080
name: http
You need to add a security context to the pod:
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
And to the container:
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- all
Thus the deployment becomes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: verify-deployment
labels:
app.kubernetes.io/name: verify-app
app.kubernetes.io/instance: verify-app
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: verify-app
app.kubernetes.io/instance: verify-app
template:
metadata:
labels:
app.kubernetes.io/name: verify-app
app.kubernetes.io/instance: verify-app
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
containers:
- image: nginxinc/nginx-unprivileged:1.20
name: verify-app
ports:
- containerPort: 8080
name: http
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- all
Patching Helm Output
If you are using standard Helm charts you may find that not everyone is running in a non-privileged way. The cluster is using GitOps toolkit to reconcile the cluster and thus patching charts needs to be done prior to the actual deployment, which means that the deployed charts needs to be secured before deployment. There are probably many ways to do this. A simple way, which allows you to work with the standard charts from the standard repos are to use the postrendering principle, where the Helm chart is rendered prior to deployment using Kustomization.
Through the HelmRelease
resource it is possible to add a path run as a post renderer. E.g.:
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: redis
spec:
chart:
spec:
chart: redis
version: 1.2.3
sourceRef:
kind: HelmRepository
name: bitnami
namespace: netic-gitops-system
postRenderers:
- kustomize:
patchesStrategicMerge:
- apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-master
namespace: pb-k8s-app
spec:
selector:
matchLabels:
app.kubernetes.io/name: redis
app.kubernetes.io/instance: redis
app.kubernetes.io/component: master
template:
metadata:
labels:
netic.dk/network-rules-egress: redis
netic.dk/network-rules-ingress: redis
netic.dk/network-component: redis
spec:
securityContext:
runAsUser: 1001
runAsGroup: 3000
fsGroup: 2000
containers:
- name: redis
securityContext:
runAsUser: 1001
allowPrivilegeEscalation: false
capabilities:
drop:
- all
- apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-replicas
namespace: pb-k8s-app
spec:
selector:
matchLabels:
app.kubernetes.io/name: redis
app.kubernetes.io/instance: redis
app.kubernetes.io/component: replica
template:
metadata:
labels:
netic.dk/network-rules-egress: redis
netic.dk/network-rules-ingress: redis
netic.dk/network-component: redis
spec:
securityContext:
runAsUser: 1001
runAsGroup: 3000
fsGroup: 2000
containers:
- name: redis
securityContext:
runAsUser: 1001
allowPrivilegeEscalation: false
capabilities:
drop:
- all
What’s next
1.3 - Ingress
Ingress is normally handled by Contour so it is possible to
define ingress by either standard Kubernetes Ingress
resources or Contour custom resource
definition HTTPProxy
.
Before you begin
Automation is set up for both TLS certificates and DNS entries. Before hand you need to agree on which DNS domains the setup should be enabled for.
Configuring Ingress
The most portable way to configure ingress is using the Kubernetes Ingress
resource as below.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: verify-ingress
spec:
tls:
- secretName: pb-sample-netic-dev-tls
hosts:
- pb.sample.netic.dev
rules:
- host: pb.sample.netic.dev
http:
paths:
- path: /verify
pathType: Prefix
backend:
service:
name: verify-service
port:
name: http
netic.dk/network-ingress: "contour"
as this activates the network policy allowing ingress to the port
named http
.TLS Termination
It is possible to issue certificates based on Let’s Encrypt by annotating the
ingress resource. Certificates are also automatically renewed. Note the Let’s Encrypt limits if doing
a lot of deployments.
The annotation: cert-manager.io/cluster-issuer: letsencrypt
means that it will uses a cluster-issuer called letsencrypt,
which is configured to use the ACME DNS Challenge to issue the certificate.
This cluster-issuer requires that Netic manages DNS for the domain to be issued.
If it is not possible to have Netic manage DNS, then it is also possible to use ACME HTTP Challenge,
this does require the cluster to be publicly available for letsencrypt to validate.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: verify-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt
kubernetes.io/tls-acme: "true"
spec:
tls:
- secretName: pb-sample-netic-dev-tls
hosts:
- pb.sample.netic.dev
rules:
- host: pb.sample.netic.dev
http:
paths:
- path: /verify
pathType: Prefix
backend:
service:
name: verify-service
port:
name: http
Ingress DNS
When a ingress resource is created a DNS A record i created that points the host to the public IP of the cluster, but only if the host in the ingress resouce is on the configured allow list. For this feature to work, Netic must manage the DNS for the host/domain.
It is possible to have Netic manage domain/subdomains, contact Netic for more information.
1.4 - Network Policies
The network policies restricts communication within the cluster to mitigate effects should a pod get compromised. A number of network policies will be deployed into a namespace by default.
Default policies
A default policy is in place denying all communication.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Besides this normally a default egress policy would also be applied.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-egress
spec:
egress:
- ports:
- port: 53
protocol: TCP
- port: 53
protocol: UDP
- port: 443
protocol: TCP
- port: 4317
protocol: TCP
podSelector: {}
policyTypes:
- Egress
Ingress policies
A few opt-in policies exists to be activated on a pod to pod basis. Allowing ingress
into a pod requires specifying the label netic.dk/network-ingress: contour
which
activates the policy below.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: contour-ingress
spec:
ingress:
- from:
- namespaceSelector:
matchLabels:
name: netic-ingress-system
ports:
- port: http
protocol: TCP
podSelector:
matchLabels:
netic.dk/network-ingress: contour
policyTypes:
- Ingress
http
no matter what the numeric port assignment is.If metrics is exposed and observability is set up there is a label to allow Prometheus
scrape netic.dk/allow-prometheus-scraping: "true"
activating the below policy.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-scrape-ingress
spec:
ingress:
- from:
- namespaceSelector: {}
podSelector:
matchLabels:
app.kubernetes.io/name: prometheus
ports:
- port: metrics
protocol: TCP
- port: http
protocol: TCP
podSelector:
matchLabels:
netic.dk/allow-prometheus-scraping: "true"
policyTypes:
- Ingress
http
og metrics
no matter what the numeric port assignment is.Additional network policies
Components inside of a namespace may also require to communicate. Defining these is requrested as a serviced definition and will then be applied by Netic.
1.5 - Observability
A lot of observability information are collected at the cluster level. All cluster level observability is accessible through the relevant dashboards. Cluster level data includes data such as pod memory and cpu consumption etc. However, it is possible to subscribe to application level observablity consisting of the collection of metrics, traces and logs.
It is recommended that application metrics and traces created using the libraries from the OpenTelemetry project. This ensures a uniform application instrumentation even acroess programming languages.
Before you begin
The application is capable of providing telemetry data:
- The application should be logging to stdout
- The application should expose Prometheus style metrics (OpenMetrics) using OpenTelemetry is recommended
- If collection of traces is desired the application should be able to push traces in Jaeger or OpenTelemetry format
See also Application Observability.
Log collection
By default all output from stdout will be captured and indexed.
Metric collection
Enabling metrics collection is done by deploying a ServiceMonitor
resource with instructions on
how Prometheus should scrape metrics off the application. Typically as below.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/instance: hello-service
app.kubernetes.io/managed-by: Kustomize
app.kubernetes.io/name: hello-service
app.kubernetes.io/version: latest
netic.dk/monitoring: <scope>
name: hello-service
spec:
endpoints:
- interval: 15s
port: http
selector:
matchLabels:
app.kubernetes.io/instance: hello-service
app.kubernetes.io/name: hello-service
Trace collection
An OpenTelemetry Collector sidecar can be injected for trace collection by annotating
the pod with sidecar.opentelemetry.io/inject: "true"
. This will allow the application
to push to localhost either as OpenTelemetry or Jaeger format.
1.6 - Vault and Secrets
The secure cloud stack includes a secrets management service to store sensitive key/value pairs to be used in the cluster. Secrets, such as credentials, usually have a lifecycle different from the lifecycle of the source code. Therefore it makes sense to handle crendentials and the like through another channel.
Before you begin
There is a requirement for some sensitive data to be provided to the workloads running inside of the cluster.
Access Data
If you want to access sensitive data from the cluster, go to the correct namespace area in the vault and create
a new secret in key-value-format. Using external-secrets
, you can synchronize this data into a secret resource
in the cluster. In the following example, the secret is called ‘vault-secret’, and contains the key ‘pb-secret-key’:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: vault-secret
spec:
dataFrom:
- extract:
key: k8s/prod1/<namespace>/vault-secret
refreshInterval: 60s
secretStoreRef:
kind: SecretStore
name: vault
target:
name: "vault-secret"
Using dataFrom
, all key-value pairs are synced onto the secret called “vault-secret”. Assuming the secret
contain only one key the result should be as seen below.
apiVersion: v1
kind: Secret
metadata:
name: vault-secret
type: Opaque
data:
pb-secret-key: dmVyeS1zZWNyZXQ=
You can check the secret and the value inside your namespace with:
kubectl get secrets vault-secret -n <namespace> -o jsonpath='{.data.pb-secret-secret}' | base64 -D
and your should get the result: very-secret
Vault Policies
Vault is setup with an area reserved for each namespace following a structure like k8s/<cluster>/<namespace>
. As described above
external-secrets will be able to read these secrets. By default users are only able to create or overwrite secrets in the Vault
to reduce the risk of secret data being leaked. This is the same principle as you cannot retrive your old password only set a new
one. Furthermore Vault is not authoritative of any secrets therefore it should be possible to re-create the secrets inside of Vault.
However, on rare occasions larger configuration structures may need to be stored inside of Vault and it can be tedious to maintain such structures when not able to read back the original data. To support this use case it is possible to indicate classification of the secret contents by using the folder structure defined below.
Path | Example | Purpose |
---|---|---|
k8s/<cluster>/<namespace>/restricted | k8s/prod1/my-awesomne-app/restricted/my-password | Purpose: This folder contains secrets that are semi-automatically maintained and can be listed, created, updated, and deleted, but cannot be read by humans. Example: Examples of secrets that could be stored here include temporary access tokens, session keys, and other data that is generated by machines and should not be accessible by humans. |
k8s/<cluster>/<namespace>/automated | k8s/prod1/my-awesomne-app/automated/ssh-key | Purpose: This folder contains secrets that are automatically maintained and can only be listed, but cannot be read, created, updated, or deleted by humans. Example: Examples of secrets that could be stored here include machine-generated encryption keys, service account credentials, and other data that is automatically managed by machines and should not be accessible by humans. |
k8s/<cluster>/<namespace>/unrestricted | k8s/prod1/my-awesomne-app/unrestricted/my-config | Purpose: This folder contains secrets that are manually maintained and can be listed, created, updated, deleted, and read by humans. Example: Examples of secrets that could be stored here include passwords, API keys, and other sensitive data that humans need to access. |
k8s/<cluster>/<namespace>/<app>/restricted | k8s/prod1/my-awesomne-app/svc1/restricted/my-password | Purpose: Same as with the general restricted folder but allows for a sub-division into application “spaces”. |
k8s/<cluster>/<namespace>/<app>/automated | k8s/prod1/my-awesomne-app/svc1/automated/ssh-key | Purpose: Same as with the general automated folder but allows for a sub-division into application “spaces”. |
k8s/<cluster>/<namespace>/<app>/unrestricted | k8s/prod1/my-awesomne-app/svc1/unrestricted/my-config | Purpose: Same as with the general unrestricted folder but allows for a sub-division into application “spaces”. |
All secrets located in the path k8s/<cluster>/<namespace>
will be considered “restricted” following the description under restricted
sub-folder.
1.7 - Stateful Deployments
If you need stateful deployment, you can use a stateful set:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: verify-deployment
labels:
app.kubernetes.io/name: verify-app
app.kubernetes.io/instance: verify-app
spec:
serviceName: verify-service
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: verify-app
template:
metadata:
labels:
app.kubernetes.io/name: verify-app
app.kubernetes.io/instance: verify-app
netic.dk/network-ingress: "contour"
annotations:
backup.velero.io/backup-volumes: verify-volume
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
containers:
- name: verify-app
image: registry.netic.dk/dockerhub/nginxinc/nginx-unprivileged:1.20
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
volumeMounts:
- name: verify-volume
mountPath: /etc/nginx
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- all
volumeClaimTemplates:
- metadata:
name: verify-volume
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
This creates one pod and a PVC with 1Gi of storage which is mounted automatically, at the specified mount path.
Annotations
Backup of volumes are not enabled by default, however, this can be enabled by adding the following annotation to your pods that uses the PVCs you want to have backed-up (the above example utilises this):
annotations:
backup.velero.io/backup-volumes: verify-volume
The backup is done to a local s3 storage and is maintained for 5 days. If you want longer retention this needs to be specified.
1.8 - Cluster Workload Determinism
Before you begin
It is possible to specify which workloads need to have priority over other workloads e.g. in a situation where e.g. a central back-end service serves all the front-end services. This could mean that the back-end service may be more important the front-end services, and thus it would be necessary to tell that to kubernetes in order for that to be able to make the right decision when pre-empting Pods. Kubernetes has an Object Type called PriorityClasses for exactly that purpose. Kubernetes itself uses these PriorityClasses internally for ensuring its own ability to run run node and system workloads, and the Secure Cloud Platform uses that same mechanism for ensuring that Technical Operations etc. is running and we can deliver the promised services.
Applications deployed on the Secure Cloud Stack may have the same need for this as seen from the example above with the front-end and back-end service, and a number of PriorityClasses has been created for that purpose:
secure-cloud-stack-tenant-namespace-application-critical
secure-cloud-stack-tenant-namespace-application-less-critical
secure-cloud-stack-tenant-namespace-application-lesser-critical
secure-cloud-stack-tenant-namespace-application-non-critical
Configuring an Application to use PriorityClasses
An application enables the use of a PriorityClass by using the PriorityClassName under the Pod Specification, underneath this is exemplified for a burstable deployment based on cpu request being set and limit not set. As explained above this may lead to an overcommit for cpu seen from a node and cluster perspective:
apiVersion: apps/v1
kind: Deployment
metadata:
name: a-customer-critical-deployment
labels:
app.kubernetes.io/name: back-end-deployment
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: back-end-deployment
template:
metadata:
labels:
app.kubernetes.io/name: back-end-deployment
spec:
terminationGracePeriodSeconds: 10 # short grace period - default is 30 seconds
priorityClassName: "secure-cloud-stack-tenant-namespace-application-critical"
containers:
- image: nginxinc/nginx-unprivileged:1.20
name: back-end-deployment
resources:
requests:
memory: 990M
cpu: 5m
limits:
memory: 990M
ports:
- containerPort: 8080
name: http
If nothing is specified for the application pods, the default assigned PriorityClassName is secure-cloud-stack-tenant-namespace-application-non-critical
. This is something supported by kubernetes itself.
The default grace period for a pod is 30 seconds, which means the pods gets preempted at that point - ready or not. If you want to ensure that lower priority pods are preemted faster, you may adjust the terminationGracePeriodSeconds
to a feasible number of seconds lower than the default.
Please note that there may be derived classes in some situations, where e.g. an operator is used, or a sidecar is used etc. which also need to have the priorityClassName
set in order for that not to be assigned default priority.
1.9 - Image Automation
Flux is able to scan image-registries for new versions of images, such that upgrades automatically can be committed
directly to your Git repository. An ImageRepository
is used to scan the registry for updates, an ImagePolicy
is
used to sorting the tags for the latest version, and an ImageUpdateAutomation
commits it to Git:
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageRepository
metadata:
name: pb-k8s-app
spec:
image: registry.netic.dk/dockerhub/nginxinc/nginx-unprivileged
interval: 1m0s
secretRef:
name: registry-secret
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImagePolicy
metadata:
name: pb-k8s-app
spec:
imageRepositoryRef:
name: pb-k8s-app
policy:
semver:
range: 1.x
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
name: pb-k8s-app
spec:
interval: 1m0s
sourceRef:
kind: GitRepository
name: sync
git:
checkout:
ref:
branch: main
commit:
author:
email: fluxcdbot@users.noreply.github.com
name: fluxcdbot
messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
push:
branch: main
In order for Flux to know where to make the change to your manifests, a comment is required in the deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: verify-deployment
spec:
replicas: 1
selector:
...
template:
metadata:
labels:
...
spec:
containers:
- image: registry.netic.dk/dockerhub/nginxinc/nginx-unprivileged:1.20 # {"$imagepolicy": "pb-k8s-app:pb-k8s-app"}
name: verify-app
...
See here for documentation.
1.10 - Relevant URLs
Grafana, Kubeconfig and Vault
This information only pertains to OnPrem clusters.
Link to relevant services and information such as Grafana, Kubeconfig and Vault are available through the provider frontpage of Grafana. The URL depends on the provider name and has the form:
https://<provider>.dashboard.netic.dk/
… where <provider>
is the name of the provider.
Other URLs
2 - Tutorials
This section contains tutorials covering various use cases from development to operations.
2.1 - Application Operations
Netic will do operations and management of Kubernetes as well as cluster wide components referred to as technical operations and management. However, monitoring and reacting to events from the deployed applications are not covered by this - this is referred to as application operations.
Getting Started
Application operations is basically setup following the below steps:
- Define incident scenarios which requires human interaction and write the standard operating procedures to be followed
- Identify and ensure metrics are available to detect the incident scenarios
- Develop alerting rules based on the metrics
- Handover alerting rules and standard operating procedures to Netic for verification and activation
Often new scenarios are discovered along the way extending the set of alerting rules over time.
Incident Scenarios
The failure scenarios may vary a lot from application to application. Strart out by trying to identify the most critical points in the application and how to detect failure in these. It is important to write a standard operating procedure to go together with every failure scenario describing the steps to be taken - even if this is to wake up the developer on duty.
Metrics
Metrics can be either specific application metrics such as rate of caught exceptions, http response error rates or other similar but metrics can also be core Kubernetes metrics such as memory or cpu consumption. The monitoring must be based on metrics rather than log patterns since metrics is much more robust to change. If the application provides application specific metrics the application must expose these see Observability.
Alerting Rules
Alerting rules are written in PromQL as the rules will be evaluated by the Prometheus rule engine. The rule expression can be testet against the real metrics using the Grafana Explore mode. Note that it is also important to consider the robustness of a rule with respect to pod restarts, flapping etc.
Handover
Once standard operating procedure and alerting rule expression is in place it can be handed over to Netic. Both will be validated and reviewed such that the 24x7 operation center is able to follow the operational procedure and such that the concrete alerting rule is capable of triggering a correct alert. This may cause a few iterations the first times.
2.2 - Application Readiness
Netic recommends for running workloads inside of Kubernetes some which is enforced by the policies of the secure cloud stack and some which is best-practice running Kubernetes. These recommendations are always valid but especially so if Netic is to provide application operations.
Security
The containers must be able to run under the following security constraints also enforced by the pod and container security context (see also Security Context).
- Running without Linux capabilities
- Running as unprivileged
- Impossible to do privilege escalation
Stability
The following concerns the ability to run the an application stable on Kubernetes even when the cluster is undergoing maintenance.
- Number of replica must be >1
- Readiness and liveness probes should be properly set
- Resource requests and limits must be set according to expected load
- Pod disruption budget should be present and allow for maintenance to be carried out (min available < replicas)
- If applicable persistent volumes should be annotated for backup
Documentation
The following concerns recommended (and required) documentation.
- Define requirements for backup; retention and restore
- Well-defined restore procedure including possibly acceptable dataloss
- Alerting rules and associated standard operating procedures must be defined
Resilience and Robustness
The following concerns the resilience, robustness and compliance.
- Cross-Origin Resource Sharing (CORS) headers are not automatically set - remember if applicable
- Observe correct use of http protocol with respect to idempodency etc. to allow for retries and other outage mitigation
- Utilize fault injection to make sure clients are resilient to unexpected conditions
- Beware to avoid sensitive log information (GDPR and otherwise)
Testing Application Operational Readiness
Prior to engaging in application operations Netic offers a workshop to assess the operational readiness of the application based on the outlined points.
2.3 - Application Observability
The secure cloud stack comes with a readymade observability setup to collect logs, metrics and traces and gain insights into application health and performance. While the platform as such is polyglot and works independt of specific programming languges, there are some recommendations with respect to development.
Before you begin
This guide assumes some familarity with the concepts of cloud native observability, i.e., logs, metrics, and traces as well as the chosen programming language.
Logs
Basically there is no requirements on logging. All that is output to stdout/stderr will be forwarded to the log indexing solution. However, it is recommended to use a logging framework to make sure the log output is consistent and allowing for more easy log parsing afterwards. Below are examples of common logging frameworks for a few languages.
It is worth mentioning that OpenTelemetry is also working on standardizing logging across languages however only alpha support currently exists for a few languages.
.NET
The .NET framework comes with logging interfaces built in and a number of 3rd party solution can be hooked into to support controlling the log output. Examples are:
Go
The standard Go libraries for logging is very seldom sufficient and a number of logging frameworks exists. Popular ones are:
Java
Java also comes with built in logging support in the java.util.logging
(jul) package though a number of 3rd frameworks
are also very popular. Interoperational bridges exists between these and also between these and the built-in Java support.
Metrics and Traces
While metrics and traces are different concepts there are some overlap. Metrics are a quantitative measure aggregating data, i.e., a counter of requests or a histogram of latencies. Distributed traces are a qualitative measure recording the exact execution path of a specific transaction through the system. However, often it is desired to record metrics in almost the same places as a span is added to a trace. This makes a natural coupling between traces and metrics. Also support is coming for enriching the aggregated metrics with trace ids representing examples, like a histogram bucket of a high latency may be reported along with an trace id of a transaction with high latency.
While the platform does not put any constraints on trace or metrics frameworks by default it is recommended to use and follow the OpenTelemetry recommendations. The OpenTelemetry project both support libraries for multiple languages and also standadizes recommendations on naming, labels etc. This allows for more easy reuse of dashboards, alerts, and more across applications. The instrumentation libraries implement the standard metrics.
.NET
Go
- OpenTelemetry
- Prometheus official client_golang
Java
- OpenTelemetry
- Prometheus official client_java
What’s next
- Activate telemetry collection - see Observability
2.4 - Distroless Container Images
Usually source code is compiled and added as a new layer on some existing container base image. Some programming languages require some interpreter to run like a Python interpreter or a virtual machine running Java bytecode.
It is convenient to use a base image populated with normal *nix tooling and maybe even based on a known Linux distribution such as Ubuntu. This allows for easy debugging by executing commands inside of the running container image. However this also expands the surface of attack both with respect to the number of tools and service that might contain vulnerabilites but also the tools aviailable should someone be able to execute arbitrary commands within the running conatiner.
At the same time the more utilities and libraries that exists in the images the bigger the image becomes. The size in itself is not a problem as such however size do matter when it comes to startup times and also the amount of storage required both on the Kubernetes worker nodes as well as in the container registry.
To reduce both attack surface and size it is recommended that production images are built based on distroless base images - if at all possible. Google provides distroless base images for a number of interpreted and compiled languages see distroless.