1 - Getting Started

This section introduces some of the core concepts to utilizing Kubernetes with secure cloud stack on top.

Before you begin

This guide expectes the following prerequisites:

  • Access to a user authorized for the namespace - see Getting Access
  • Familiar with core concepts of gitops

Verifying Access

Deployment is based on git and gitops - specifically Flux. A namespace must already have been setup. It is possible to find the specific reconciliation setup for a namespace using kubectl.

Getting the gitrepo resource will display the repository associated with the namespace as well as the status for pulling in changes.

kubectl get -n <namespace> gitrepo

Getting the Kustomization reousource will display status of applying resources in the cluster. The specific path within the git repo used for reconciliation can also be found in the Kustomization resource.

kubectl get -n <namespace> kustomization

You are now ready to deploy by pushing commits to the git repository.

What’s next

1.1 - Getting Access

1.1.1 - Netic on Premise

Getting access to Netic managed and operated Kubernetes cluster on-prem requires a few steps.

Before you begin

This guide expectes the following prerequisites:

  • A namespace has been created associated with a git repository for gitops based reconciliation
  • Access to a user authorized for the namespace/cluster
  • kubectl has been installed
  • The kubelogin plugin has been installed

Access to Cluster

Access to a Kubernetes cluster requires a kubeconfig. Authentication and authorization is based on OIDC and it is possible to download a kubeconfig file from your observability dashboard at https://<provider_name>.dashboard.netic.dk. The downloaded configuration depends on the kubelogin plugin to be installed. The plugin is capable of requesting and caching an OAuth 2.0 access token.

When you sign into Grafana the first page you are met with shows you the kubeconfig file for the clusters and namespaces you have access to.

It is possible to check access using kubectl

kubectl auth can-i --list -n <namespace>

Create kubeconfig manually

If you prefer, you can create the kubeconfig file manually.

Replacing the <>-tokens with their corresponding values, create the following kubeconfig.yaml file:

apiVersion: v1
kind: Config
preferences: {}
clusters:
  - name: default
    cluster:
      certificate-authority: <api-server>.crt
      server: https://<api-server:port>
users:
  - name: keycloak
    user:
      exec:
        apiVersion: client.authentication.k8s.io/v1beta1
        command: kubectl
        args:
          - oidc-login
          - get-token
          # This allows for authentication on, e.g., bastion host. Disabled on
          # local workstations.
          # - --grant-type=authcode-keyboard
          - --oidc-use-pkce
          - --oidc-issuer-url=https://keycloak.netic.dk/auth/realms/mcs
          - --oidc-client-id=<cluster_name>.<provider>.<cluster_type>.k8s.netic.dk
contexts:
  - context:
      cluster: default
      user: keycloak
    name: default
current-context: default

Then, get the certificate from the api server.

Again, replace <>-tokens with the proper values.

true | openssl s_client -connect <api-server:port> -showcerts 2>/dev/null \
  | sed --quiet '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' \
  > <api-server>.crt

Using the configuration you can start using kubectl:

kubectl --kubeconfig <api-server>.yaml get nodes

What’s next

1.1.2 - Azure Kubernetes Service (AKS)

Getting access to Netic managed and operated Kubernetes cluster in Azure requires a few steps.

Before you begin

This guide expectes the following prerequisites:

  • A namespace has been created associated with a git repository for gitops based reconciliation
  • Access to a user authorized for the namespace/cluster
  • kubectl has been installed
  • The azure-kubelogin plugin (from k8s 1.24 onwards) has been installed

Access to Cluster

Access to a Kubernetes cluster requires a kubeconfig. Authentication and authorization is based on OIDC. The configuration depends on the Azure kubelogin plugin to be installed. The plugin is capable of requesting and caching an OAuth 2.0 access token.

For Azure you can get the kubeconfig file for the clusters you have access to using the following commands:

az login

az account set --subscription <subscription id>

az aks get-credentials --resource-group <resource group name> --name <aks service name> -f <output file name>

It is possible to check access using kubectl

kubectl --kubeconfig <output file name> auth can-i --list -n <namespace>

What’s next

1.2 - Security Context

By default a namespace is setup to adhere to the Restricted Pod Security Standard. Your deployment must be configured to adhere to this to be accepted for deployment otherwise the pods wont be created.

Before you begin

The manifests for deploying the workload inside of the cluster is available.

Adjusting Deployment

Having a deployment like so:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: verify-deployment
  labels:
    app.kubernetes.io/name: verify-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: verify-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: verify-app
    spec:
      containers:
      - image: nginxinc/nginx-unprivileged:1.20
        name: verify-app
        ports:
        - containerPort: 8080
          name: http

You need to add a security context to the pod:

      securityContext:
          runAsUser: 1000
          runAsGroup: 3000
          fsGroup: 2000

And to the container:

        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - all

Thus the deployment becomes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: verify-deployment
  labels:
    app.kubernetes.io/name: verify-app
    app.kubernetes.io/instance: verify-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: verify-app
      app.kubernetes.io/instance: verify-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: verify-app
        app.kubernetes.io/instance: verify-app
    spec:
      securityContext:
          runAsUser: 1000
          runAsGroup: 3000
          fsGroup: 2000
      containers:
      - image: nginxinc/nginx-unprivileged:1.20
        name: verify-app
        ports:
        - containerPort: 8080
          name: http
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - all

Patching Helm Output

If you are using standard Helm charts you may find that not everyone is running in a non-privileged way. The cluster is using GitOps toolkit to reconcile the cluster and thus patching charts needs to be done prior to the actual deployment, which means that the deployed charts needs to be secured before deployment. There are probably many ways to do this. A simple way, which allows you to work with the standard charts from the standard repos are to use the postrendering principle, where the Helm chart is rendered prior to deployment using Kustomization.

Through the HelmRelease resource it is possible to add a path run as a post renderer. E.g.:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: redis
spec:
  chart:
    spec:
      chart: redis
      version: 1.2.3
      sourceRef:
        kind: HelmRepository
        name: bitnami
        namespace: netic-gitops-system
  postRenderers:
    - kustomize:
        patchesStrategicMerge:
          - apiVersion: apps/v1
            kind: StatefulSet
            metadata:
              name: redis-master
              namespace: pb-k8s-app
            spec:
              selector:
                matchLabels:
                  app.kubernetes.io/name: redis
                  app.kubernetes.io/instance: redis
                  app.kubernetes.io/component: master
              template:
                metadata:
                  labels:
                    netic.dk/network-rules-egress: redis
                    netic.dk/network-rules-ingress: redis
                    netic.dk/network-component: redis
                spec:
                  securityContext:
                      runAsUser: 1001
                      runAsGroup: 3000
                      fsGroup: 2000

                  containers:
                    - name: redis
                      securityContext:
                        runAsUser: 1001
                        allowPrivilegeEscalation: false
                        capabilities:
                          drop:
                            - all
          - apiVersion: apps/v1
            kind: StatefulSet
            metadata:
              name: redis-replicas
              namespace: pb-k8s-app
            spec:
              selector:
                matchLabels:
                  app.kubernetes.io/name: redis
                  app.kubernetes.io/instance: redis
                  app.kubernetes.io/component: replica
              template:
                metadata:
                  labels:
                    netic.dk/network-rules-egress: redis
                    netic.dk/network-rules-ingress: redis
                    netic.dk/network-component: redis
                spec:
                  securityContext:
                      runAsUser: 1001
                      runAsGroup: 3000
                      fsGroup: 2000
                  containers:
                    - name: redis
                      securityContext:
                        runAsUser: 1001
                        allowPrivilegeEscalation: false
                        capabilities:
                          drop:
                            - all

What’s next

1.3 - Ingress

Ingress is normally handled by Contour so it is possible to define ingress by either standard Kubernetes Ingress resources or Contour custom resource definition HTTPProxy.

Before you begin

Automation is set up for both TLS certificates and DNS entries. Before hand you need to agree on which DNS domains the setup should be enabled for.

Configuring Ingress

The most portable way to configure ingress is using the Kubernetes Ingress resource as below.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: verify-ingress
spec:
  tls:
    - secretName: pb-sample-netic-dev-tls
      hosts:
        - pb.sample.netic.dev
  rules:
    - host: pb.sample.netic.dev
      http:
        paths:
          - path: /verify
            pathType: Prefix
            backend:
              service:
                name: verify-service
                port:
                  name: http

TLS Termination

It is possible to issue certificates based on Let’s Encrypt by annotating the ingress resource. Certificates are also automatically renewed. Note the Let’s Encrypt limits if doing a lot of deployments. The annotation: cert-manager.io/cluster-issuer: letsencrypt means that it will uses a cluster-issuer called letsencrypt, which is configured to use the ACME DNS Challenge to issue the certificate. This cluster-issuer requires that Netic manages DNS for the domain to be issued. If it is not possible to have Netic manage DNS, then it is also possible to use ACME HTTP Challenge, this does require the cluster to be publicly available for letsencrypt to validate.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: verify-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt
    kubernetes.io/tls-acme: "true"
spec:
  tls:
    - secretName: pb-sample-netic-dev-tls
      hosts:
        - pb.sample.netic.dev
  rules:
    - host: pb.sample.netic.dev
      http:
        paths:
          - path: /verify
            pathType: Prefix
            backend:
              service:
                name: verify-service
                port:
                  name: http

Ingress DNS

When a ingress resource is created a DNS A record i created that points the host to the public IP of the cluster, but only if the host in the ingress resouce is on the configured allow list. For this feature to work, Netic must manage the DNS for the host/domain.

It is possible to have Netic manage domain/subdomains, contact Netic for more information.

1.4 - Network Policies

The network policies restricts communication within the cluster to mitigate effects should a pod get compromised. A number of network policies will be deployed into a namespace by default.

Default policies

A default policy is in place denying all communication.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Besides this normally a default egress policy would also be applied.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-egress
spec:
  egress:
  - ports:
    - port: 53
      protocol: TCP
    - port: 53
      protocol: UDP
    - port: 443
      protocol: TCP
    - port: 4317
      protocol: TCP
  podSelector: {}
  policyTypes:
  - Egress

Ingress policies

A few opt-in policies exists to be activated on a pod to pod basis. Allowing ingress into a pod requires specifying the label netic.dk/network-ingress: contour which activates the policy below.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: contour-ingress
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: netic-ingress-system
    ports:
    - port: http
      protocol: TCP
  podSelector:
    matchLabels:
      netic.dk/network-ingress: contour
  policyTypes:
  - Ingress

If metrics is exposed and observability is set up there is a label to allow Prometheus scrape netic.dk/allow-prometheus-scraping: "true" activating the below policy.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-scrape-ingress
spec:
  ingress:
  - from:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          app.kubernetes.io/name: prometheus
    ports:
    - port: metrics
      protocol: TCP
    - port: http
      protocol: TCP
  podSelector:
    matchLabels:
      netic.dk/allow-prometheus-scraping: "true"
  policyTypes:
  - Ingress

Additional network policies

Components inside of a namespace may also require to communicate. Defining these is requrested as a serviced definition and will then be applied by Netic.

1.5 - Observability

A lot of observability information are collected at the cluster level. All cluster level observability is accessible through the relevant dashboards. Cluster level data includes data such as pod memory and cpu consumption etc. However, it is possible to subscribe to application level observablity consisting of the collection of metrics, traces and logs.

It is recommended that application metrics and traces created using the libraries from the OpenTelemetry project. This ensures a uniform application instrumentation even acroess programming languages.

Before you begin

The application is capable of providing telemetry data:

  • The application should be logging to stdout
  • The application should expose Prometheus style metrics (OpenMetrics) using OpenTelemetry is recommended
  • If collection of traces is desired the application should be able to push traces in Jaeger or OpenTelemetry format

See also Application Observability.

Log collection

By default all output from stdout will be captured and indexed.

Metric collection

Enabling metrics collection is done by deploying a ServiceMonitor resource with instructions on how Prometheus should scrape metrics off the application. Typically as below.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/instance: hello-service
    app.kubernetes.io/managed-by: Kustomize
    app.kubernetes.io/name: hello-service
    app.kubernetes.io/version: latest
    netic.dk/monitoring: <scope>
  name: hello-service
spec:
  endpoints:
  - interval: 15s
    port: http
  selector:
    matchLabels:
      app.kubernetes.io/instance: hello-service
      app.kubernetes.io/name: hello-service

Trace collection

An OpenTelemetry Collector sidecar can be injected for trace collection by annotating the pod with sidecar.opentelemetry.io/inject: "true". This will allow the application to push to localhost either as OpenTelemetry or Jaeger format.

1.6 - Vault and Secrets

The secure cloud stack includes a secrets management service to store sensitive key/value pairs to be used in the cluster. Secrets, such as credentials, usually have a lifecycle different from the lifecycle of the source code. Therefore it makes sense to handle crendentials and the like through another channel.

Before you begin

There is a requirement for some sensitive data to be provided to the workloads running inside of the cluster.

Access Data

If you want to access sensitive data from the cluster, go to the correct namespace area in the vault and create a new secret in key-value-format. Using external-secrets, you can synchronize this data into a secret resource in the cluster. In the following example, the secret is called ‘vault-secret’, and contains the key ‘pb-secret-key’:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: vault-secret
spec:
  dataFrom:
    - extract:
        key: k8s/prod1/<namespace>/vault-secret
  refreshInterval: 60s
  secretStoreRef:
    kind: SecretStore
    name: vault
  target:
    name: "vault-secret"

Using dataFrom, all key-value pairs are synced onto the secret called “vault-secret”. Assuming the secret contain only one key the result should be as seen below.

apiVersion: v1
kind: Secret
metadata:
  name: vault-secret
type: Opaque
data:
  pb-secret-key: dmVyeS1zZWNyZXQ=

You can check the secret and the value inside your namespace with:

kubectl get secrets vault-secret -n <namespace> -o jsonpath='{.data.pb-secret-secret}' | base64 -D

and your should get the result: very-secret

Vault Policies

Vault is setup with an area reserved for each namespace following a structure like k8s/<cluster>/<namespace>. As described above external-secrets will be able to read these secrets. By default users are only able to create or overwrite secrets in the Vault to reduce the risk of secret data being leaked. This is the same principle as you cannot retrive your old password only set a new one. Furthermore Vault is not authoritative of any secrets therefore it should be possible to re-create the secrets inside of Vault.

However, on rare occasions larger configuration structures may need to be stored inside of Vault and it can be tedious to maintain such structures when not able to read back the original data. To support this use case it is possible to indicate classification of the secret contents by using the folder structure defined below.

 Path Example Purpose
 k8s/<cluster>/<namespace>/restricted k8s/prod1/my-awesomne-app/restricted/my-password Purpose: This folder contains secrets that are semi-automatically maintained and can be listed, created, updated, and deleted, but cannot be read by humans.
Example: Examples of secrets that could be stored here include temporary access tokens, session keys, and other data that is generated by machines and should not be accessible by humans.
k8s/<cluster>/<namespace>/automatedk8s/prod1/my-awesomne-app/automated/ssh-key Purpose: This folder contains secrets that are automatically maintained and can only be listed, but cannot be read, created, updated, or deleted by humans.
Example: Examples of secrets that could be stored here include machine-generated encryption keys, service account credentials, and other data that is automatically managed by machines and should not be accessible by humans.
k8s/<cluster>/<namespace>/unrestrictedk8s/prod1/my-awesomne-app/unrestricted/my-configPurpose: This folder contains secrets that are manually maintained and can be listed, created, updated, deleted, and read by humans.
Example: Examples of secrets that could be stored here include passwords, API keys, and other sensitive data that humans need to access.
k8s/<cluster>/<namespace>/<app>/restricted k8s/prod1/my-awesomne-app/svc1/restricted/my-password Purpose: Same as with the general restricted folder but allows for a sub-division into application “spaces”.
k8s/<cluster>/<namespace>/<app>/automatedk8s/prod1/my-awesomne-app/svc1/automated/ssh-key Purpose: Same as with the general automated folder but allows for a sub-division into application “spaces”.
k8s/<cluster>/<namespace>/<app>/unrestrictedk8s/prod1/my-awesomne-app/svc1/unrestricted/my-configPurpose: Same as with the general unrestricted folder but allows for a sub-division into application “spaces”.

All secrets located in the path k8s/<cluster>/<namespace> will be considered “restricted” following the description under restricted sub-folder.

1.7 - Stateful Deployments

If you need stateful deployment, you can use a stateful set:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: verify-deployment
  labels:
    app.kubernetes.io/name: verify-app
    app.kubernetes.io/instance: verify-app
spec:
  serviceName: verify-service
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: verify-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: verify-app
        app.kubernetes.io/instance: verify-app
        netic.dk/network-ingress: "contour"
      annotations:
        backup.velero.io/backup-volumes: verify-volume
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 3000
        fsGroup: 2000
      containers:
        - name: verify-app
          image: registry.netic.dk/dockerhub/nginxinc/nginx-unprivileged:1.20
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 8080
          volumeMounts:
            - name: verify-volume
              mountPath: /etc/nginx
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - all
  volumeClaimTemplates:
    - metadata:
        name: verify-volume
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

This creates one pod and a PVC with 1Gi of storage which is mounted automatically, at the specified mount path.

Annotations

Backup of volumes are not enabled by default, however, this can be enabled by adding the following annotation to your pods that uses the PVCs you want to have backed-up (the above example utilises this):

      annotations:
        backup.velero.io/backup-volumes: verify-volume

The backup is done to a local s3 storage and is maintained for 5 days. If you want longer retention this needs to be specified.

1.8 - Cluster Workload Determinism

Before you begin

It is possible to specify which workloads need to have priority over other workloads e.g. in a situation where e.g. a central back-end service serves all the front-end services. This could mean that the back-end service may be more important the front-end services, and thus it would be necessary to tell that to kubernetes in order for that to be able to make the right decision when pre-empting Pods. Kubernetes has an Object Type called PriorityClasses for exactly that purpose. Kubernetes itself uses these PriorityClasses internally for ensuring its own ability to run run node and system workloads, and the Secure Cloud Platform uses that same mechanism for ensuring that Technical Operations etc. is running and we can deliver the promised services.

Applications deployed on the Secure Cloud Stack may have the same need for this as seen from the example above with the front-end and back-end service, and a number of PriorityClasses has been created for that purpose:

  secure-cloud-stack-tenant-namespace-application-critical
  secure-cloud-stack-tenant-namespace-application-less-critical
  secure-cloud-stack-tenant-namespace-application-lesser-critical
  secure-cloud-stack-tenant-namespace-application-non-critical

Configuring an Application to use PriorityClasses

An application enables the use of a PriorityClass by using the PriorityClassName under the Pod Specification, underneath this is exemplified for a burstable deployment based on cpu request being set and limit not set. As explained above this may lead to an overcommit for cpu seen from a node and cluster perspective:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: a-customer-critical-deployment
  labels:
    app.kubernetes.io/name: back-end-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: back-end-deployment
  template:
    metadata:
      labels:
        app.kubernetes.io/name: back-end-deployment
    spec:
      terminationGracePeriodSeconds: 10 # short grace period - default is 30 seconds
      priorityClassName: "secure-cloud-stack-tenant-namespace-application-critical"
      containers:
      - image: nginxinc/nginx-unprivileged:1.20
        name: back-end-deployment
        resources:
            requests:
              memory: 990M
              cpu: 5m
            limits:
              memory: 990M
        ports:
        - containerPort: 8080
          name: http

If nothing is specified for the application pods, the default assigned PriorityClassName is secure-cloud-stack-tenant-namespace-application-non-critical. This is something supported by kubernetes itself.

The default grace period for a pod is 30 seconds, which means the pods gets preempted at that point - ready or not. If you want to ensure that lower priority pods are preemted faster, you may adjust the terminationGracePeriodSeconds to a feasible number of seconds lower than the default.

Please note that there may be derived classes in some situations, where e.g. an operator is used, or a sidecar is used etc. which also need to have the priorityClassName set in order for that not to be assigned default priority.

1.9 - Image Automation

Flux is able to scan image-registries for new versions of images, such that upgrades automatically can be committed directly to your Git repository. An ImageRepository is used to scan the registry for updates, an ImagePolicy is used to sorting the tags for the latest version, and an ImageUpdateAutomation commits it to Git:

apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageRepository
metadata:
  name: pb-k8s-app
spec:
  image: registry.netic.dk/dockerhub/nginxinc/nginx-unprivileged
  interval: 1m0s
  secretRef:
    name: registry-secret
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImagePolicy
metadata:
  name: pb-k8s-app
spec:
  imageRepositoryRef:
    name: pb-k8s-app
  policy:
    semver:
      range: 1.x
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
  name: pb-k8s-app
spec:
  interval: 1m0s
  sourceRef:
    kind: GitRepository
    name: sync
  git:
    checkout:
      ref:
        branch: main
    commit:
      author:
        email: fluxcdbot@users.noreply.github.com
        name: fluxcdbot
      messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
    push:
      branch: main

In order for Flux to know where to make the change to your manifests, a comment is required in the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: verify-deployment
spec:
  replicas: 1
  selector:
    ...
  template:
    metadata:
      labels:
        ...
    spec:
      containers:
      - image: registry.netic.dk/dockerhub/nginxinc/nginx-unprivileged:1.20 # {"$imagepolicy": "pb-k8s-app:pb-k8s-app"}
        name: verify-app
...

See here for documentation.

1.10 - Relevant URLs

Grafana, Kubeconfig and Vault

This information only pertains to OnPrem clusters.

Link to relevant services and information such as Grafana, Kubeconfig and Vault are available through the provider frontpage of Grafana. The URL depends on the provider name and has the form:

https://<provider>.dashboard.netic.dk/

… where <provider> is the name of the provider.

Other URLs

2 - Tutorials

This section contains tutorials covering various use cases from development to operations.

2.1 - Application Operations

Netic will do operations and management of Kubernetes as well as cluster wide components referred to as technical operations and management. However, monitoring and reacting to events from the deployed applications are not covered by this - this is referred to as application operations.

Getting Started

Application operations is basically setup following the below steps:

  1. Define incident scenarios which requires human interaction and write the standard operating procedures to be followed
  2. Identify and ensure metrics are available to detect the incident scenarios
  3. Develop alerting rules based on the metrics
  4. Handover alerting rules and standard operating procedures to Netic for verification and activation

Often new scenarios are discovered along the way extending the set of alerting rules over time.

Incident Scenarios

The failure scenarios may vary a lot from application to application. Strart out by trying to identify the most critical points in the application and how to detect failure in these. It is important to write a standard operating procedure to go together with every failure scenario describing the steps to be taken - even if this is to wake up the developer on duty.

Metrics

Metrics can be either specific application metrics such as rate of caught exceptions, http response error rates or other similar but metrics can also be core Kubernetes metrics such as memory or cpu consumption. The monitoring must be based on metrics rather than log patterns since metrics is much more robust to change. If the application provides application specific metrics the application must expose these see Observability.

Alerting Rules

Alerting rules are written in PromQL as the rules will be evaluated by the Prometheus rule engine. The rule expression can be testet against the real metrics using the Grafana Explore mode. Note that it is also important to consider the robustness of a rule with respect to pod restarts, flapping etc.

Handover

Once standard operating procedure and alerting rule expression is in place it can be handed over to Netic. Both will be validated and reviewed such that the 24x7 operation center is able to follow the operational procedure and such that the concrete alerting rule is capable of triggering a correct alert. This may cause a few iterations the first times.

2.2 - Application Readiness

Netic recommends for running workloads inside of Kubernetes some which is enforced by the policies of the secure cloud stack and some which is best-practice running Kubernetes. These recommendations are always valid but especially so if Netic is to provide application operations.

Security

The containers must be able to run under the following security constraints also enforced by the pod and container security context (see also Security Context).

  • Running without Linux capabilities
  • Running as unprivileged
  • Impossible to do privilege escalation

Stability

The following concerns the ability to run the an application stable on Kubernetes even when the cluster is undergoing maintenance.

  • Number of replica must be >1
  • Readiness and liveness probes should be properly set
  • Resource requests and limits must be set according to expected load
  • Pod disruption budget should be present and allow for maintenance to be carried out (min available < replicas)
  • If applicable persistent volumes should be annotated for backup

Documentation

The following concerns recommended (and required) documentation.

  • Define requirements for backup; retention and restore
  • Well-defined restore procedure including possibly acceptable dataloss
  • Alerting rules and associated standard operating procedures must be defined

Resilience and Robustness

The following concerns the resilience, robustness and compliance.

  • Cross-Origin Resource Sharing (CORS) headers are not automatically set - remember if applicable
  • Observe correct use of http protocol with respect to idempodency etc. to allow for retries and other outage mitigation
  • Utilize fault injection to make sure clients are resilient to unexpected conditions
  • Beware to avoid sensitive log information (GDPR and otherwise)

Testing Application Operational Readiness

Prior to engaging in application operations Netic offers a workshop to assess the operational readiness of the application based on the outlined points.

2.3 - Application Observability

The secure cloud stack comes with a readymade observability setup to collect logs, metrics and traces and gain insights into application health and performance. While the platform as such is polyglot and works independt of specific programming languges, there are some recommendations with respect to development.

Before you begin

This guide assumes some familarity with the concepts of cloud native observability, i.e., logs, metrics, and traces as well as the chosen programming language.

Logs

Basically there is no requirements on logging. All that is output to stdout/stderr will be forwarded to the log indexing solution. However, it is recommended to use a logging framework to make sure the log output is consistent and allowing for more easy log parsing afterwards. Below are examples of common logging frameworks for a few languages.

It is worth mentioning that OpenTelemetry is also working on standardizing logging across languages however only alpha support currently exists for a few languages.

.NET

The .NET framework comes with logging interfaces built in and a number of 3rd party solution can be hooked into to support controlling the log output. Examples are:

Go

The standard Go libraries for logging is very seldom sufficient and a number of logging frameworks exists. Popular ones are:

Java

Java also comes with built in logging support in the java.util.logging (jul) package though a number of 3rd frameworks are also very popular. Interoperational bridges exists between these and also between these and the built-in Java support.

Metrics and Traces

While metrics and traces are different concepts there are some overlap. Metrics are a quantitative measure aggregating data, i.e., a counter of requests or a histogram of latencies. Distributed traces are a qualitative measure recording the exact execution path of a specific transaction through the system. However, often it is desired to record metrics in almost the same places as a span is added to a trace. This makes a natural coupling between traces and metrics. Also support is coming for enriching the aggregated metrics with trace ids representing examples, like a histogram bucket of a high latency may be reported along with an trace id of a transaction with high latency.

While the platform does not put any constraints on trace or metrics frameworks by default it is recommended to use and follow the OpenTelemetry recommendations. The OpenTelemetry project both support libraries for multiple languages and also standadizes recommendations on naming, labels etc. This allows for more easy reuse of dashboards, alerts, and more across applications. The instrumentation libraries implement the standard metrics.

.NET

Go

Java

What’s next

2.4 - Distroless Container Images

Usually source code is compiled and added as a new layer on some existing container base image. Some programming languages require some interpreter to run like a Python interpreter or a virtual machine running Java bytecode.

It is convenient to use a base image populated with normal *nix tooling and maybe even based on a known Linux distribution such as Ubuntu. This allows for easy debugging by executing commands inside of the running container image. However this also expands the surface of attack both with respect to the number of tools and service that might contain vulnerabilites but also the tools aviailable should someone be able to execute arbitrary commands within the running conatiner.

At the same time the more utilities and libraries that exists in the images the bigger the image becomes. The size in itself is not a problem as such however size do matter when it comes to startup times and also the amount of storage required both on the Kubernetes worker nodes as well as in the container registry.

To reduce both attack surface and size it is recommended that production images are built based on distroless base images - if at all possible. Google provides distroless base images for a number of interpreted and compiled languages see distroless.