Getting Started with Service Mesh: Origin, Development, and Current Status
Although service mesh is no longer a new concept, it is still a popular area of exploration. This article describes the core concepts of service mesh, the reasons why we need it, and its main implementations. I hope this article can help you better understand the underlying mechanism and trends of service mesh.
With the advent of the cloud-native era, the microservices architecture and the container deployment model have become increasingly popular, gradually evolving from buzz words into the technological benchmarks for modern IT enterprises. The giant monolithic application that was once taken for granted is carefully split into small but independent microservices by architects, then packaged into Docker images with their own dependencies by engineers through container deployment, and finally deployed and run in the well-known Kubernetes system through the mysterious DevOps pipeline.
This all sounds easy enough. But there is no such thing as free lunch. Every coin has two sides and microservices are no exception.
- In the past, only a monolithic application needed to be deployed and managed. Now, it has been split into several parts, causing O&M costs to rise exponentially.
- Originally, modules could interact directly through calls within the application (inter-process communication). Now, they are split into different processes or even different nodes and can only communicate through remote procedure calls (RPCs).
Does this mean microservices only look good on the outside, while causing headaches for developers and operators? This is obviously not the case. Those “lazy” programmers have a lot of tricks to overcome these difficulties. For example, the DevOps and containerization advocated by cloud native are an almost perfect solution to the first problem. Multiple applications can be integrated, built, and deployed quickly through the automated CI/CD pipeline. In addition, resource scheduling and O&M are facilitated through Docker images and Kubernetes orchestration. As for the second problem, let’s look at how service mesh ensures communication between microservices.
What Is Service Mesh?
The Birth of Service Mesh
From concept to implementation? No, from implementation to concept.
On September 29, 2016, we were about to take a holiday. At an internal sharing session on microservices in Buoyant, “Service Mesh”, the buzz word that would dominate cloud native field over the next few years, was created.
Naming is really an art. Microservices -> Service Mesh is a link between past and future and allows development to take its course. Just looking at its name, I can understand what it does: connects the various service nodes of microservices through a mesh. In this way, microservices that were split into small parts are closely connected by a service mesh. Though separated in different processes, microservices are connected in a monolithic application, making communication easier.
Unlike most concepts, this one had substance. On January 15, 2016, service mesh initially released its first implementation Linkerd. On January 23, 2017, service mesh joined the Cloud Native Computing Foundation (CNCF) and released Linkerd 1.0 on April 25 of the same year. For Buoyant, this may be a small step, but it is a big step towards maturity in the cloud-native field. Today, the concept of service mesh has taken root with a variety of production-level implementations and large-scale practices. But we must remember the hero behind all of this, William Morgan, the CEO of Buoyant and the pioneer of the service mesh, and his definition of and ideas about service mesh. His thoughts are presented in the article: What is a service mesh? And why do I need one?
Service mesh can be defined in one sentence as follows:
A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It is responsible for the reliable delivery of requests through the complex topology of services that comprise a modern and cloud native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware.
This is how the CEO OF Buoyant defined the service mesh. Let’s look at some key phrases.
- Dedicated infrastructure layer: A service mesh is not designed to solve business issues, but is a dedicated infrastructure layer (middleware).
- Service-to-service communication: A service mesh is designed to handle service-to-service communication.
- Reliable delivery of requests: Why is special processing necessary for service-to-service communication? A service mesh aims to ensure the reliable delivery of requests between services even when the network is unreliable.
- Cloud native application: From the beginning, service mesh was created for modern cloud native applications and targeted future technology development trends.
- Network proxies: Typically, a service mesh is implemented as an array of lightweight network proxies, without the awareness of the application.
- Deployed alongside application code: The network proxies must be deployed with the application for one-to-one service. If communication between the application and the proxy is remote and unreliable, we have not addressed the issue.
The Pattern of Service Mesh
The Chinese have a saying: “Building the road is the first step to become rich”. However, in today’s world, the roads we build must be for cars, not the horses of the past.
The figure on the left shows the deployment of Linkerd. On the host or pod where each microservice is located, a Linkerd proxy is deployed for RPCs between microservice application instances. The application is not aware of any of this. It initiates its own RPCs as usual and does not need to know the address of the peer server because the service discovery is handled by the proxy node.
On the right is a higher-dimension abstract figure to help us better understand the logic of a service mesh. Imagine that this is a large-scale microservices cluster at the production level, where hundreds of service instances and corresponding proxy nodes of the service mesh are deployed. Service-to-service communication relies on these dense proxy nodes, which together form the pattern of a modern traffic grid.
Why Do We Need Service Mesh?
The Rise of Microservices
The chaos after the Big Bang.
Most of us grew up in the era of the monolithic application. Monolithic means that all components are packed into the same application. Therefore, these components are naturally and closely linked. They are developed based on the same technology stack, access shared databases, and support joint deployment, O&M, and scaling. Moreover, the communication among these components tends to be frequent and coupled, taking the form of function calls. There was nothing wrong with this architecture. After all, the software systems of that time were relatively simple. A monolithic application with 20,000 lines of codes could easily handle all business scenarios.
Those long divided, must unite; those long united, must divide. Thus it has ever been. The inherent limitations of monolithic applications begin to be exposed with the increasing complexity of modern software systems and more and more collaborators. Just like the singularity before the Big Bang, monolithic applications began to expand at an accelerated rate and exploded with a “bang” when they reached the critical point a few years ago. In this way, the advent of the microservices has made software development “small but beautiful” again.
- Single responsibility: After an application is split up, a single microservice is usually only responsible for a single highly cohesive and self-contained function. Therefore, it is easy to develop, understand, and maintain.
- Flexible architecture: Different microservice applications are basically independent in terms of technology selection, allowing each to select the most suitable technology stack.
- Isolated deployment: In contrast to a giant monolithic application, an individual microservice application has significantly less code and output, facilitating continuous integration and rapid deployment. At the same time, through process-level isolation, microservice applications can continue to function if a peer application encounters a fault. This makes them more fault tolerant than monolithic applications.
- Independent expansion: In the era of monolithic applications, if a module has a resource bottleneck (such as CPU or memory), it could only be expanded by expanding the entire application, resulting in significant resource waste. In the era of microservices, we can expand individual microservices as needed for more precise scaling.
However, microservices are not a solution to every problem. Although the Big Bang ended the hegemony of monolithic applications, the era that follows it was a chaotic period were various architectures will exist together in competition. Developers in the era of monolithic applications were forced to adapt to the changes brought about by microservices. The biggest change is in service-to-service communication.
How Can We Find Service Providers?
Microservice communication must be implemented through RPCs. HTTP and REST protocols are essentially RPCs. When one application needs to consume the services of another application, you cannot obtain service instances through a simple in-process mechanism (such as Spring dependencies) as in monolithic applications. You do not even know if there is such a service provider.
How Can We Ensure the Reliability of RPCs?
RPCs must rely on the IP network. As we all know, the network (compared with computing and storage) is the most unreliable thing in the world. Despite the transmission control protocol (TCP), frequent packet loss, exchange failure, and even cable damage often occur. Even if the network connection is strong, what if the other machine goes down or the process is overloaded?
How Can We Reduce the Service Call Latency?
Network communication suffers from latency in addition to being unreliable. Microservice applications in the same system are usually deployed together, resulting in low latency from calls in the same data center. However, for complex service links, it is very common for one service access to involve dozens of RPCs, which results in severe latency.
How Can We Ensure the Security of Service Calls?
In addition to unreliability and latency, network communication is not secure. In the Internet age, you never know who you are actually communicating with. Similarly, if you simply adopt a bare communication protocol for service-to-service communication, you will never know whether the peer is authentic or if communication is monitored by a man-in-the-middle.
Service Communication: Stone Age
Chairman Mao once said, “If you work with your own hands, you will have enough food and clothing.”
To introduce the preceding microservice architecture, the earliest engineers each had to reinvent the wheel themselves.
- Service discovery: helps you find services to call.
- Circuit breaker: mitigates unreliable dependencies between services.
- Load balancing: delivers requests more promptly by evenly distributing traffic.
- Secure communication: involves transport layer security (TLS), identification (certificates and signatures), and role-based access control (RBAC).
Programmers are used to writing code to solve problems. This is their job. But, where does all their time go?
- Reinventing the wheel: How can we concentrate on business innovation if we need to write and maintain a large amount of non-functional code?
- Business coupling: Service communication logic and business code logic are mixed together, resulting in some strange distributed bugs.
Service Communication: Modern Times
Sharing and reuse is key.
The more conscientious engineers cannot sit still. They think you are violating the principle of sharing and reuse. How can you face the pioneers of GNU? As a result, a variety of high-quality, standardized, and universal products, including Apache Dubbo, Spring Cloud, Netflix OSS, and gRPC, were developed.
These reusable class libraries and frameworks have greatly improved quality and efficiency, but are they good enough? Not quite. These have the following problems:
- Not completely transparent: Programmers still need to correctly understand and use these libraries and the learning cost and the probability of errors are still very high.
- Limited technology selection: After applying these technologies, an application will often be bound to the corresponding language. and framework (vendor lock-in).
- High maintenance costs: When libraries are upgraded, the application must be built and deployed again. This is annoying and can cause faults.
Service Communication: The New Age
Service Mesh is just a porter.
The service mesh solves all of the preceding problems. It sounds amazing. How is it done? In short, the service mesh strips the functions of class libraries and frameworks the application by introducing a sidecar mode and sinks these functions to the infrastructure layer. This following the ideas of abstraction and layering in old operating systems (applications do not need to select a specific network protocol stacks) and the software as a service (SaaS) idea of modern cloud computing platforms with services managed from the bottom up (IaaS -> CaaS -> PaaS -> SaaS).
The diagrams of service mesh evolution used above were taken from Service Mesh Pattern, which you can refer to for more information.
Mainstream Implementations of Service Mesh
Note: The following contents are collected for reference only. For further study, see the latest authoritative materials.
Overview of Mainstream Implementations
- Linkerd: This implementation was developed by Buoyant in Scala. On January 15, 2016, it was initially released and joined CNCF on January 23, 2017. On May 1, 2018, Linkerd 1.4.0 was released.
- Envoy: This implementation was developed by Lyft in C++ 11. On September 13, 2016, it was initially released and joined CNCF on September 14, 2017. On March 21, 2018, Envoy 1.6.0 was released.
- Istio: This implementation was developed by Google and IBM in Go. On May 10, 2017, it was initially released. On March 31, 2018, Istio 0.7.1 was released.
- Conduit: This implementation was also developed by Buoyant in Rust and Go. On December 5, 2017, it was initially released. On April 27, 2018, Conduit 0.4.1 was released.
The core component of Linkerd is a service proxy. Therefore, as long as we understand its request processing flow, we can master its core logic.
- Dynamic routing: The downstream target service can be determined based on the upstream service request parameters. In addition to conventional service routing policies, Linkerd can support canary release, A/B testing, environment isolation, and other such scenarios through its dynamic routing capabilities.
- Service discovery: After the target service is determined, the next step is to obtain the address list of the corresponding instances (such as by querying the service registry).
- Load balancing: If there are multiple addresses in the list, Linkerd will select one appropriate low-latency instances through the load balancing algorithm (such as Least Loaded or Peak EWMA).
- Request execution: Send a request to the instance selected in the preceding step and record the latency and response.
- Retry: If the request does not receive a response, select another instance to retry it (Linkerd must know that the request is idempotent).
- Circuit breaking: If requests sent to an instance often fail, the instance is automatically removed from the address list.
- Timeout: If the request times out (no result is returned before the given deadline), a failure response is returned automatically.
- Observability: Linkerd continuously collects and reports behavioral data, including Metrics and Tracing.
Envoy is a type of high-performance service mesh software with the following features:
- High performance: Envoy is written based on the local code (C++ 11), making it faster than Linkerd based on Scala.
- Scalable: Both L4 and L7 proxy functions are based on the pluggable Filter Chain mechanism (similar to Netfilter and servlet filter).
- Protocol upgrade: It supports two-way and transparent HTTP/1 to HTTP/2 proxies.
- Other features: service discovery (ensures eventual consistency), load balancing (supports region awareness), stability (supports retry, timeout, circuit breaking, speed limit, and anomaly detection), observability (supports statistics, logs, and tracing), and easy debugging.
Istio is a complete service mesh suite that separates the control plane and data plane, and consists of the following components.
- Envoy: forms the data plane (other components form the control plane) and can be replaced by other proxies (such as Linkerd or nginMesh).
- Pilot: is responsible for traffic management and provides the platform-specific service model definition, APIs, and implementations.
- Mixer: is responsible for policies and controls. Its core functions include PreCheck, quota management, and telemetry reporting.
- Istio-Auth: supports RBAC permission control at multiple granularities and two-way SSL authentication, including identification, communication security, and key management capabilities.
Istio Components — Pilot
Pilot is the navigator in the Istio service mesh. It manages the traffic rules and service discovery on the data plane. A typical application scenario is canary release or blue-green deployment. Based on the rule API provided by Pilot, developers issue traffic routing rules to the Envoy proxy on the data plane, accurately allocating multi-version traffic (such as by allocating 1% of traffic to the new service).
Istio Component — Mixer
Mixer is the tuner in the Istio service mesh. It implements traffic policies (such as access control and speed limit) and observes and analyzes the traffic based on logs, monitoring, and tracing. These features are achieved by the Envoy Filter Chain mentioned earlier. Mixer mounts its own filters to the pre-routing expansion point and the post-routing expansion point.
Istio Component — Auth
Auth is the security officer in Istio service mesh. It authenticates and authorizes the communication between service nodes. For authentication, Auth supports two-way SSL authentication between services, which allows both sides in the communication to recognize each other’s identity. For authorization, Auth supports the popular RBAC, which enables convenient and fine-grained multi-level access control based on users, roles, and permissions.
Conduit is a next-generation service mesh produced by Buoyant. A challenger to Istio, Conduit retains an overall architecture similar to Istio, which clearly distinguishes the control plane and the data plane. In addition, it has the following key features.
- Ultralight and blazingly fast: The data plane of Conduit is written in Rust, making it incredibly small, fast, and secure. Compared with C and C++, the biggest advantage of Rust is its security. Each proxy requires less than 10 MB of memory (RSS) and has sub-millisecond p99 latency, providing the functionality of service mesh without the cost.
- Security from the start: From Rust’s memory security to default TLS, Conduit is built to provide secure cloud-native environments from the ground up.
- End-to-end visibility: Conduit automatically measures and aggregates service success rates, latencies and request volumes, giving you an unfettered view of service behavior across your infrastructure without having to change the application code.
- Kubernetes enhanced: Conduit adds reliability, visibility, and security to your Kubernetes cluster, giving you control of the runtime behavior of your applications.
Starting from a consideration of microservice communication in the cloud-native era, this article describes the origin, development, and current status of service mesh in the hope of giving you a basic understanding of it.