The Sound and the Fury of Serverless
By Xu Xiaobin, Senior Middleware Expert at Alibaba and the author of Maven in Practice. He worked on the evolution of the AliExpress microservice architecture. Currently, he is responsible for the development and implementation of serverless technology in the Alibaba Group.
Since the release of Lambda on Amazon Web Service (AWS) in 2016, developers and cloud vendors around the world have been enthusiastic about serverless applications. If you do not want to spend too much effort on developing an application and deploying it on a server, a simple architecture model may be able to meet your needs. This is made possible by the serverless architecture, which is a new and popular technology in today’s software architecture industry. In this article, the author describes the development of the serverless industry in detail based on his years of research and development experience.
The Sound and the Fury is a novel by William Faulkner. It describes the tragedy of a family over three generations through the stream of consciousness of different family members. The interesting part of this novel is that different people see the same thing in different ways.
This is also true for the serverless industry, so I will start my analysis from this point.
Serverless is not Straightforward
Similar to big data and AI, serverless is not straightforward. Everyone talks about it, nobody really knows how to do it, and everyone thinks everyone else is doing it, so everyone claims they are doing it.
Like many other concepts, serverless is not precisely defined and does not have de facto standards. What are de facto standards? Kubernetes is a de facto standard. Spring Boot and Spring Cloud are de facto standards for Java programmers, for example.
A de facto standard is an idea or a methodology that has been widely implemented and adopted in the market. Implementation generally means the following two things:
- The solution is open-source. Therefore, it is used by everyone without the need to worry about being tied to a specific vendor.
- There are a lot of success stories. Many people have used it in key commercial systems, so it has been widely verified.
Are there de facto standards in the serverless and Function as a Service (FaaS) fields today? Not yet.
The following chart is from Google Trends. The red line indicates microservices, while the blue line indicates serverless applications.
Since the release of Lambda on AWS in 2016, developers and cloud vendors around the world have been enthusiastic about serverless applications and confident in the serverless vision. What is this vision?
The vision is an architecture without servers. However, engineers all know that servers really exist. Even if the servers are abstracted and invisible to us, they still perform their role.
Personally, I think that the clearest description of the serverless vision is a metaphor from a paper published by UC Berkeley in February this year:
In short, the way we operate cloud resources today is similar to the way a programmer wrote code in assembly languages decades ago.
If you haven’t learned assembly languages or have forgotten assembly languages, the following is a snapshot I took from a book:
Are you new to the registers, stacks, program counters, and assembly instructions in the image? If you write business logic in this language, your efficiency cannot be high.
Today, although the basic computer architecture has not essentially changed, the program runtime environment has fundamentally changed compared to that of 20 years ago. Twenty years ago, most programs ran on a single machine. Now, all our programs have to be designed to run on the cloud.
For this purpose, we need to perform supporting tasks, including applying for and recycling cloud resources (containers, caches, and queues) and controlling auto scaling. These tasks have nothing to do with business logic, but the R&D and O&M personnel must dedicate a great deal of time to them.
The following is an analogy:
- In the standalone era, the operating system (OS) managed hardware resources at the resource layer. High-level languages allowed programmers to describe businesses at the business layer. Compilers and VMs translated high-level languages into machine code and transmitted it to the OS.
- In the cloud era, containers, distributed queues, distributed caches, and distributed file systems are used instead of individual CPUs, memory, and hard disks.
The role of the cloud OS has been basically taken over by the Kubernetes ecosystem, but this is not the case for cloud compilers, VMs, development languages, and frameworks.
Today, when we migrate applications to the cloud (making them cloud-native), we often do two things:
- The first is to divide large applications into microservices.
- The second is to write many YAML files to manage cloud resources.
In essence, we are migrating applications developed based on the standalone architecture to the cloud. I think there are two huge gaps in this process, which are shown as gray boxes in the above figure:
1) Programming Languages and Frameworks
Currently, mainstream programming languages are basically designed for applications that run in a standalone architecture. To use distributed systems, you have to add another framework. The resources also run on the standalone architecture. There are some exceptions, such as Erlang/OTP, which were designed for distributed systems from the very beginning.
In the cloud era, the basic units of resources have changed from CPU and memory to containers, functions, and distributed queues. In addition, cloud-native systems are inherently distributed. Therefore, the synchronization model that was widely used in the standalone era is no longer suitable.
Programmers should not spend too much time on compiling YAML files. Instead, these resource-oriented YAML files should be generated by machines, which I call cloud compilers. High-level programming languages are used to express the domain model and logic of the business. Cloud compilers are used to compile the languages into resource descriptions.
Personally, I am very optimistic about Erlang’s Actor model. This model is also implemented in other languages, such as Elixir, which has a similar syntax to Ruby and runs on Erlang OTP as well as Akka on JVM and Orleans on .NET.
Different from models designed based on other languages, the Actor model was designed for distributed systems from the beginning. Therefore, I think it will be feasible to replace the resources of this model with pure cloud resources.
In short, I think the serverless vision can be expressed as follows:
Write locally, compile to the cloud.
What Is Everyone Doing?
In addition to the vision of serverless, we should also look at what others are doing.
To describe relevant products and technologies clearly, let’s divide the tasks performed in the serverless field into three layers from bottom to top: the resource layer, the DevOps layer, and the framework and runtime layer.
Tasks at the resource layer focus on the lifecycle management and security isolation of resources such as containers. In the Kubernetes field, products such as Firecracker and gVisor implement lightweight security sandboxes. This layer focuses on how to produce resources faster and ensure security.
At the DevOps layer, the focus is on change management, traffic management, auto scaling, and interconnectivity with event-based models and the cloud ecosystem. The core aim of this layer is to make O&M unnecessary (NoOps). Although all cloud vendors have their own products (various FaaS offerings), I personally prefer Knative, an open-source product, for two reasons:
- Firstly, it provides complete models.
- Secondly, its ecosystem is developing in a rapid, healthy manner. Probably, products of all cloud vendors will have to be compatible with Knative in the future, just as products of all cloud vendors are compatible with Kubernetes today.
The following figure shows the increase in Knative contributors and contributions over the past year. The data is sourced from the speech “Knative a Year Later: Serverless, Kubernetes and You.”
As for the framework and runtime layer, due to my personal experience, I am only concerned with the Java field. In fact, the aim of this layer is to address the slow startup of Java applications (GraalVM). Of course, it is also very important to avoid being dependent on the framework of a particular vendor. We all are afraid of using the products of only one cloud vendor because we will have to modify the code when migrating our products to another cloud. Spring Cloud Function focuses on solving this problem.
What Are Users’ Immediate Needs?
A successful product must have its own core competencies, which often involve the ability to solve a problem that has not been solved by other products. I call these types of problems users’ immediate needs. Therefore, what immediate needs can serverless applications meet? Let’s first take a quick look at the users themselves:
Many technical products have gone through the following four stages of development:
Startup Stage: A small team determines the appropriate technology to quickly launch a new business in a trial-and-error manner.
At this stage, the team is very small with two or three members, and all the code is placed in one application with no need for distribution or isolation.
Mature Stage: The business has achieved some initial success, the number of users is increasing, and the business is becoming increasingly complex.
The team now has dozens to hundreds of members. All team members are in the same department and have ample mutual trust and easy communication. Since a single application model can no longer meet the need for collaboration, architects began to split applications and distribute them to different systems, isolating different services by the process.
Platform Stage: The business is highly successful, and the team hopes to apply the capabilities they have built up to other similar businesses.
Compared with the mature stage, the platform stage sees some new changes. Firstly, the number of developers increases, often reaching into the hundreds or even thousands. Secondly, most of the developers are no longer members of the core product team. They are often from different departments, resulting in less mutual trust and communication between developers.
The core team is able to organize and control developers from other departments. Therefore, technical isolation is prioritized to prevent platform-wide failures due to the mistake of a single developer.
When isolation measures are taken, costs become a concern. When hundreds of plug-ins on the platform and the platform itself run in the same process, resources are naturally reused and we just need to roughly estimate the required resources. However, when hundreds of plug-ins are isolated and run in independent containers, we have to deploy additional scheduling systems to control and optimize the required resources.
Cloud Product Stage: This stage begins, when a platform becomes so successful that the team wants to develop it into a cloud platform to support similar businesses in the market.
Isolation is an important but not always necessary requirement at the platform stage (isolation is not really implemented on many platforms), but products at the cloud product stage must be highly isolated.
The more significant reason for isolation at the platform stage is stability (avoiding platform-wide failures), while the most important reason for isolation at the cloud product stage is security.
As shown in the figure, the product developers are no longer in the same organization as the product team, and some developers may even seek to harm the organization that owns the cloud product. Therefore, isolation must be implemented at the level of containers, VMs, and networks.
As technical products develop and the number of developers increases, the core team’s control over these developers decreases, and the communication and trust between the developers decrease. This leads to ever-increasing risks to stability and security, requiring continuously enhanced isolation. With the introduction of isolation measures and the continuous growth of required resources, costs become a concern. To better allocate resources and reduce costs, scheduling is required.
Therefore, for products at the platform stage and cloud product stage, technical isolation and scheduling capabilities are immediate needs.
Framework and Runtime Innovation
All the preceding discussion has focused on stability, security, and resource costs. Now, we need to discuss another topic: development efficiency. Technology-specific development efficiency is reflected in the framework.
We can further divide the frameworks into two categories:
1) Frameworks that are designed to solve technical problems and improve the development efficiency: For example, Spring deals with object assembly through dependency injection, HSF deals with synchronous communication between distributed systems, RocketMQ deals with asynchronous communication between distributed systems, and Hystrix deals with network unreliability caused by distributed communication. With these frameworks, the inherent complexity of the technology is largely shielded from upper-layer systems.
2) Frameworks that are designed to solve business problems and improve the development efficiency: Many Alibaba business platform teams develop business frameworks based on their own scenarios, such as transactions, stores, and supply chains. This allows for fast development and business iteration.
Generally, frameworks oriented to a technical problem are developed by a team, while business problem-oriented frameworks are provided by various business platform teams. This again proves the correctness of Conway’s law. Conway’s law is just the formal statement of the old adage “the tail wags the dog”. Technical teams are not willing to solve business problems, and business platform teams are not as adept in solving technical problems. As a result, the two frameworks suffer from a serious disconnect.
You may have heard a story like the following:
An evil dragon required the village to sacrifice a virgin every year. Every year, a young hero from the village would fight the evil dragon, but no one ever survived. When yet another hero set out to fight the dragon, someone followed him. The dragon cave was full of gold and silver treasures. The hero killed the dragon with his sword and then sat down on the dragon’s body, looking at the shining jewels. The hero slowly grew scales, a tail, and claws, and finally turned into an evil dragon.
This is a fairy tale, but it certainly reflects a real process we see in many areas of life, including the mainstream frameworks used in some large and medium-sized R&D organizations. These frameworks are vital to the development of these organizations. However, as cloud businesses continue to develop, these organizations are now creating cloud-native applications and building cloud-based business systems, but their frameworks still force users to bind languages (such as Java) and integrate logic into user applications before a business-based design is implemented. Some frameworks even require that user code be deployed to large applications of the platform.
These limitations have helped business developers realize business objectives and create business value in a short period of time, but they do not encourage framework innovation in the long run. Such organizations are accustomed to waiting for the proper team to solve problems, while the members of this “proper team” may not be aware of these problems.
Frameworks that focus on the short-term business value within a specific organization will not provide many advantages when migrated to the cloud and the community, where frameworks are expected to address universal needs.
Traditional frameworks and runtime environments only manage standalone resources. When everyone is building their own businesses based on cloud-based services, frameworks and runtime environments need to manage cloud resources instead of standalone resources.
Many such products are already available in this industry, including Terraform and Pulumi, but I think they are not enough. I would describe an ideal cloud-native framework as follows:
- It should free developers from managing cloud resources. Developers do not like to write YAML files in the same way as an assembly language. Therefore, the framework must support features such as resource allocation, recycling, and orchestration.
- The framework should be purely asynchronous and event-driven. This is determined by the distributed nature of cloud-native. If the programming language paradigm is still the synchronous model, the framework cannot be implemented.
- It should not tie users to a specific vendor. Only a vendor-neutral development framework can be widely used. The framework can define application programming interfaces (APIs) for which specific vendors provide relevant drivers.
- In addition, the framework should provide the necessary programming paradigm for cloud resource management and large-scale software development. The term “programming paradigm” may not be the perfect description, but I can’t think of a better term. Object-oriented design is the most popular programming paradigm, and Spring is built around this programming paradigm. Solving two problems with one framework will give developers an excellent experience.
Serverless seems extremely elegant but is very complicated to implement. This complexity is due to the wide range of engineering technologies involved, the great diversity in user expectations, and the great differences in people’s ideas of the future.
As I further explore this field with my team, I will need to constantly revisit what I have previously learned and thought, so I plan for this article to be the first in a series that carries on this discussion.