The Open Application Model from Alibaba’s Perspective

In this article, we’re going to hear from Alibaba Cloud engineers on why they wanted to partner with Microsoft’s teams to create the Open Application Model (OAM), which is a new industry-wide, open standard for developing and operating applications on Kubernetes and other platforms.

For some background, you can check out the original announcment from here

Key Takeaways

  • OAM is an open specification to define cloud native applications, with a goal of establishing application centric infrastructure in discoverable, manageable and platform-agnostic approach. In addition, there is an OAM implementation (Rudr) designed specifically for Kubernetes.
  • Alibaba is putting its experience of running both internal cluster and public cloud offerings, specifically, moving from defining in-house application CRD to a standard application model into OAM.
  • A major goal is to leave the inherent complexity of infrastructure to infrastructure engineers only and improve accuracy and efficiency in cooperation between various participants in the pipeline.

What is the Open Application Model (OAM)?

In OAM, an Application is made from three core concepts. The first is the Components that make up an application, which might comprise a collection of microservices, a database and a cloud load balancer.

The second concept is a collection of Traits which describe the operational characteristics of the application such as capabilities like auto-scaling and ingress which are important to the operation of applications but may be implemented in different ways in different environments.).

Finally, to transform these descriptions into a concrete application, operators use a configuration file to assemble components with corresponding traits to form a specific instance of an application that should be deployed.

We are putting its experience of running both internal cluster and public cloud offerings, specifically, moving from defining in-house application CRD to a standard application model into OAM. As engineers, we thrive on innovation-based learning from past failures and mistakes.

In this article, we share our motivations and the driving force behind this project, in the hope of helping the wider community better understand the OAM.


Who We Are

We Manage All Kinds of Kubernetes Clusters

  • Scale up to 10,000 nodes;
  • Run over 10,000 applications;
  • Handle 100,000 deployments per day in peak time

At the same time, we support the Alibaba Cloud Kubernetes service, which is similar to other public cloud Kubernetes offerings for external customers, where the number of clusters is huge (~10,000) but size of each cluster is typically small or moderate. Our customers, both internal and external, have very diverse requirements and use cases, in terms of workload management.

We Serve Application Operators, Who Serve Developers

Application Developers — Deliver business value in the form of code. Most are not aware of infrastructure or K8s and they interact with PaaS and CI pipeline to manage their applications. The productivity of developers is highly valuable.

Application Operators — Serve developers with expertise of capacity, stability and performance of the clusters so to help developers configure, deploy, and operate applications at scale (e.g. updating, scaling, recovery). Note that although application operators understand APIs and capabilities of K8s, they do not work on K8s directly. In most cases, they leverage the PaaS system to serve developers with underlying K8s capabilities. In this case, many application operators are in fact PaaS engineers as well.

In one word, infra operators, like us, serve application operators, who in turn serve developers.

The Problems of Cooperation

We’ll go through the pain points of the various players in the following sections, but in a nutshell, the fundamental issue we found is the lack of a structured way to build efficient and accurate interactions among the different parties. This leads to inefficient application management process or even operational failures.

A standard application model is our approach to solve this problem.

Interactions Between Infra Operators and Application Operators

One example of such as issue is that at Alibaba, we developed CronHPA CRD to scale application based on CRON expressions. It’s useful when an application’s scaling policy differs between day and night. CronHPA is an optional capability, and deployed only on-demand in some of our clusters.

A sample CronHPA specification yaml looks like this:

apiVersion: ""
kind: CronHPA
name: cron-scaler
timezone: America/Los_Angeles
- cron: '0 0 6 * * ?'
minReplicas: 20
maxReplicas: 25
- cron: '0 0 19 * * ?'
minReplicas: 1
maxReplicas: 9
apiVersion: apps/v1
name: php-apache
- type: Resource
name: cpu
type: Utilization
averageUtilization: 50

This a typical Kubernetes Custom Resource and should be straightforward to use.

However, we quickly get notified about several problems from application operators when they use customized plugins like CronHPA:

1. Discovering the specification of new capability is difficult.

Application operators often complained that the specification of a capability can be anywhere. It is sometimes in its CRD, sometime in ConfigMap, and sometimes in configuration file in a random place. They are also confused — why do we not have every extension in K8s described by CRD (e.g., CNI and CSI plugins) so that it could be learned and used easily?

2. Confirming the existence of specific capability in a particular cluster is difficult.

Application operators are unsure if an operational capability is ready in a given cluster, especially when this capability is provided by a newly developed plugin. Multiple rounds of communication between infra operators and application operators are needed to bring clarity to the concerns.

Besides the discoverability problems above, there is an additional challenge with regards to manageability.

3. Conflicts in capabilities could be troublesome

Usually there are many extended capabilities in a K8s cluster. The relationships between those capabilities could be summarized into the following three categories:

  • Orthogonal — Capabilities are independent from each other. For example, Ingress for traffic management and Persistent Storage for storage management.
  • Composable — Capabilities can be applied to the same application cooperatively. For example, Ingress and Rollout: Rollout upgrades the application and controls Ingress for progressive traffic shifting.
  • Conflicting — Capabilities should not be applied to the same application. For example, HPA and CronHPA; they conflict with each other if applied to the same application.

Orthogonal and composable capabilities are less troublesome. However, conflicting capabilities can lead to unexpected/unpredictable behaviors.

The problem — it’s difficult for application operator to be warned of conflicts beforehand. Hence, they may apply conflict capabilities to the same application. When conflict actually happens, resolving it comes with a cost, and in extreme cases, conflicts can result in catastrophic application failures. Naturally, application operators don’t want to feel as if the Sword of Damocles is hanging over their heads when managing platform capabilities, hence they want a better methodology to avoid conflict scenarios beforehand.

How can application operators discover and manage capabilities that could potentially be in conflict with each other? In other words, as infra operators, can we build discoverable and manageable capabilities for application operators?

OAM’s Traits

These platform capabilities are essentially operational characteristics of the application, and this is where the name “Trait” in OAM comes from.

Discoverable Capabilities

Note that traits are not equivalent to K8s plugins; one cluster could have multiple networking related traits like “dynamic QoS trait”, “bandwidth control trait” and “traffic mirror trait” which are provided by one CNI plugin.

In practice, traits are installed in the K8s cluster and used by application operators. When capabilities are presented as traits, an application operator can discover the supported capabilities by a simple kubectl get command:

$ kubectl get traits
cron-scaler 19m
auto-scaler 19m

The above example shows that this cluster supports two kinds of “scaler” capabilities. One could deploy an application that requires CRON-based scale policy to this cluster.

A Trait Provides A Structured Description for A Given Capability.

For example, kubectl describe trait cron-scaler:

kind: Trait
name: cron-scaler
properties: |
"description":"Timezone for the CRON expressions of this scaler.",
"description":"CRON expression for this scaling rule.",
"description":"Lower limit for the number of replicas.",
"description":"Upper limit for the number of replicas.",

Note that in OAM, the properties of trait spec could be json-schema.

The Trait spec is decoupled from its implementation by design. This is helpful considering there could be dozens of implementations. for a specific capability in K8s. Trait provides a unified description to help application operators understand and use the capability accurately.

Manageable Capabilities

Take this sample ApplicationConfiguration as an example:

kind: ApplicationConfiguration
name: failed-example
- name: nginx-replicated-v1
instanceName: example-app
- name: auto-scaler
minimum: 1
maximum: 9
- name: cron-scaler
timezone: "America/Los_Angeles"
schedule: "0 0 6 * * ?"
cpu: 50

In OAM, it’s required for ApplicationConfiguration controller to determine traits compatibility and fail the operation if the combination cannot be satisfied. Upon submitting the above YAML to Kubernetes, the controller will report failure due to “conflicts between traits.” Application operators will then be notified of the conflicts beforehand, and will not find any surprises due to conflicting traits afterward.

Overall, instead of providing lengthy maintenance specifications and operating guidelines, which are still unable to prevent application operators from making mistakes, we use OAM traits to expose discoverable and manageable capabilities on top of Kubernetes. This allows our application operators to “fail fast” and have the confidence to assemble capabilities to construct conflict-free operational solutions, as simple as playing “Legos.”

Interactions Between Application Operators and Application Developers

Let’s first look at a simple deployment YAML file:

kind: Deployment
apiVersion: extensions/v1beta1
name: nginx-deployment
replicas: 3
deploy: example
deploy: example
- name: nginx
image: nginx:1.7.9
allowPrivilegeEscalation: false

In our clusters, it’s the application operator works cooperatively with developer to prepare this yaml. This cooperation is time-consuming and not easy, but we have to. Why?

Sorry, Not My Concern

For example, how many developers know allowPrivilegeEscalation?

While not known by many, it is utterly important to have this field set to false, to ensure the application has proper privileges in the real host. Typically, application operators configure this field. But, in practice, fields like this end up becoming “guessing games,” or they may even be completely ignored by developers. As a result, this can cause potential troubles if application operators do not validate those fields.

Who is the Real Owner?

In this case, the workload spec cannot represent the workload’s final state and this can be very confusing from developer’s perspective. We once attempted to use fieldManager to deal with this issue. The processes of resolving such conflict is still challenging, because it’s hard to figure out the intention of the other modifier.

Is “Clear Cut” the Solution?

A straightforward solution is to draw a “clear boundary” between the developers and operators. For example, we can only allow developers to set part of the deployment yaml (this is exactly our PaaS was once doing). But, before applying the “clear cut” solutions, we may want to consider other scenarios.

Developers’ Voices Should be Heard

In fact, the “clear cut” application management process will make it even harder to express developers’ operational opinions. There are many similar examples, where a developer might want to convey that their application:

  • Cannot be scaled (i.e. singleton)
  • Is a batch job, not a long running service
  • Requires highest level security, etc.

All these requests are valid, because the developer, the author of the application, best understands his or her application. This raises a fundamental problem which we seek to resolve: Is it possible to provide separated API subsets for application developers and operators, while allowing developers to claim operational requirements efficiently?

OAM’s Component and ApplicationConfiguration

Define the Application; Don’t Just Describe It.

Here’s a sample of the Component defined by developer for an Nginx deployment:

A component in OAM is composed of three parts:

  1. Workload description — how to run this component
  2. Component description — what to run, e.g., container image, workload type
  3. A list of overwritable parameters which are expressed as schemas

First of all, in Component spec the description of “how to run” is fully decoupled from “what to run.” This decoupling makes the workloadType field a straightforward way to convey developer’s opinions about how to run his application to the operator. Among these types, core workloads are pre-defined in OAM to cover typical patterns of cloud native applications:

  1. Does this component expose service endpoint or not?
  2. Is this component replicable or not?
  3. Is this component long-running or one-time (is daemonized or not)?

That being said, the implementation of OAM is free to claim its own workload types by defining Extended Workloads. In Alibaba, we rely heavily on the extended workloads to enable developers to define cloud service base components like Functions etc.

Secondly, let’s go into some details about “overwritable parameters”:

  • parameters: the parameters in this list are allowed to be overwritten by operators (or by the system); schemas are defined for operators to follow during overwriting.
  • fromParam: indicates the value of CONN, and actually comes from a parameter named connections in the parameters list, i.e. this value could be overwritten by operators.
  • The current value of connections is default as 1024.

Overwritable parameters in Component is another field which allows developers to claim their opinions about “which part of my app definition is overridable” to operators (or to the system).

Note that in the above example, the developer does not need to set replicas anymore; it's essentially not his concern, and he will let HPA or application operator fully control the replica number.

Overall, Component allows developers to define application specification with his own api set, but at the same time, provides abilities for him or her to convey opinions or information to operators accurately. This information includes both operational opinions, such as, “how to run this application,” and “overwritable parameters,” like those shown below.

In addition to these operational hints, the developer could have many more types of requirements to claim in application definitions, and an operational capability, a.k.a trait, should have corresponding ability to claim it matches to given requirements. Hence we are actively working on “policies” in Component so a developer can say “my component requires some traits that satisfy this policy,” and a trait can list all of the policies it supports.

The ApplicationConfiguration

The usage of Component and ApplicationConfiguration forms a practice of cooperative workflow:

  1. The platform provides various workload types.
  2. Developer defines component.yaml with selected workload type.
  3. Application operator (or the CI/CD system) then runs kubectl apply -f component.yaml to install this component.
  4. Application operator then defines ApplicationConfiguration with app-config.yaml to instantiate the application.
  5. Lastly, the application operator runs kubectl apply -f app-config.yaml to trigger the deployment of the whole application.

A sample of app-config.yaml is below:

kind: ApplicationConfiguration
name: my-awesome-app
- componentName: nginx
instanceName: web-front-end
- name: connections
value: 4096
- name: auto-scaler
minimum: 3
maximum: 10
- name: security-policy
allowPrivilegeEscalation: false

Let’s highlight several details from the above ApplicationConfiguration YAML:

  1. parameterValues - Used by the operator to overwrite the connections value to 4096, which was 1024 initially in the component.
  2. Note that the operator has to fill in integer 4096 instead of string "4096", because the schema of this field is clearly defined in Component.
  3. Trait auto-scaler - Used by the operator to apply autoscaler trait (e.g. HPA) to the component. Hence, its replica number will be fully controlled by autoscaler.
  4. Trait security-policy - Used by the operator to apply the security policy rules to the component.

Note that the operator could also amend more traits to the traits list as long as they are available. For example, the “Canary Deployment Trait” will make the application follow the canary rollout strategy during upgrade later.

Essentially, ApplicationConfiguration is how application operator (or the system) consume information conveyed from developer, and then assemble operational capabilities to achieve his final operational intention accordingly.

Beyond Application Management

  1. How to build discoverable, composable and manageable platform capabilities;
  2. How to enable several parties to accurately and efficiently work on the same platform, using the same API set.

In this context, OAM is the CRD specification for Alibaba’s Kubernetes team to define application as well as its operational capabilities in standard and structured approach.

Another strong motivation for us to develop OAM is software distribution in hybrid cloud and multi-environments. With the emerging of Google Anthos and Microsoft Arc, we did see the trend of Kubernetes is becoming the new Android with the value of cloud native eco-system is moving to the application layer. We will talk about this part later.

Real world use cases in this article are contributed by Alibaba Cloud Native Team and Ant Financial.

The Future of OAM

We look forward to working with the community on OAM spec as well as its K8s implementation. OAM is a neutral open source project and all its contributors are under CLA from non-profit foundation.

About the Authors

Xiang Li is Senior Staff Engineer of Alibaba. He works on Alibaba’s cluster management system and helps with Kubernetes adoption for the entire Alibaba group. Prior to Alibaba, Xiang led the Kubernetes upstream team at CoreOS. He is also the creator of etcd and Kubernetes operator pattern.

Lei Zhang is Staff Engineer of Alibaba. Lei is co-maintaining Kubernetes project. Lei is now working on engineering effort in Alibaba including its Kubernetes and cloud native application management system.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.