Unveiled for Double 11, StarAgent: Alibaba’s Automatic O&M System

Image for post
Image for post

By Song Jian, nickname Song Yi, technical expert for Alibaba Cloud O&M middle end.

As an engineer working as part of the IT operations and maintenance (O&M) teams at a major internet company, do you still remember those nights when you had to get up super early just to restart the servers? Along that same train of thought, you may be wondering then, how ever did Alibaba keep millions of servers in the cloud keep running in a secure, stable and even efficient manner with all the huge traffic spikes they met up against during the Double 11 annual online shopping event?

Well, today, as a technical expert working in Alibaba’s O&M middle end team, I’d like to show you how it was done this year with Alibaba Cloud’s all-new automatic O&M system, StarAgent. StarAgent helped to keep this year’s Double 11 Global Shopping Festival afloat. In this article, I’m going to show you how StarAgent was capable of managing millions of servers and how StarAgent is as an important piece of the O&M infrastructure of Alibaba as is water, gas, or even electricity in our daily lives.

The original writer of this article, Song Jian is a technical expert that works on Alibaba’s O&M middle-end team. He possesses deep insights into the theory and practices of the large-scale O&M systems and automatic O&M system of StarAgent that are implemented at Alibaba. Song joined Alibaba around a decade ago. While at Alibaba, Song has worked on establishing Alipay’s basic monitoring system from and its very initial stages, has driven the integration of Alibaba Group’s monitoring systems, and he also led the PE team for O&M tools and testing.

StarAgent

In view of the R&D Collaboration 2.0 intelligent O&M platform (or just StarOps for short), O&M is implemented on two platforms, which are the basic O&M platform and the application O&M platform. The basic O&M platform is referred to as StarAgent, which is definitely the IT O&M infrastructure of Alibaba.

From 10 thousand to 100 thousand and then to millions of servers, the significance of the O&M infrastructure has been gradually perceived and discovered. When our O&M systems could not meet the rapid growth of servers and business in terms of stability, performance, and capacity, we upgraded our architecture in 2015. Then, StarAgent came in and increased the success ratio from 90% to 99.995% and the number of daily calls from 10 million to over 100 million.

At that time, there were few enterprises that used millions of servers in the world and many enterprises were divided into different business units that managed their servers separately. There were fewer scenarios in which a system managed millions of servers. Therefore, we had little experience to learn from others. For most time, we explored the way by ourselves and our system gradually evolved to what it looks like today.

Product Introduction

Image for post
Image for post

As shown in the preceding figure, StarAgent comprises the server layer, O&M layer, and business layer. Different teams cooperate by layer. The preceding figure also shows the position of StarAgent in Alibaba Group. It is the unique agent officially recognized by Alibaba Group.

  • Server layer: This layer comprises all servers that are installed with StarAgent by default.

Scenarios

Image for post
Image for post

StarAgent functions throughout the lifecycle of a server:

  • Asset check: After a server is mounted, it is set to start up from the network and is installed with a mini operating system that runs in the memory. This operating system includes StarAgent. After the operating system is started, StarAgent will issue a command to retrieve the hardware information of the server for asset checking, including the CPU, memory, and manufacturer and size of disks.

Routine O&M: StarAgent can be used to perform routine O&M tasks, including server logon, single-server operations, and batch operations, for example, cleaning up before an application goes offline.

Product Data

Image for post
Image for post

Certain data in the preceding figure comes from Alibaba. StarAgent performs over 100 million server operations every day, manipulates 500 thousand servers per minute, and manages millions of servers. StarAgent has over 150 plug-ins, uses fewer agent resources, and supports mainstream editions of Linux and Windows.

Features

Image for post
Image for post

StarAgent provides two core features: management channel and system configuration.

This is similar to configuration and management products such as SaltStack, Puppet, and Ansible. However, StarAgent is more refined.

  • Management channel: All O&M operations will be converted to commands that can be run on servers. This command channel is unique on the entire network, which provides capabilities such as controlling user permissions, auditing operations, and blocking high-risk commands.
Image for post
Image for post

According to the specific features of portals, APIs, and agents, portals are mainly used by field development and O&M personnel, APIs are called by upper-layer O&M systems, and agents host the capabilities that can be directly used on each server.

Portal

  • O&M Market: It is also referred to as the plug-in platform, which is similar to the app market on mobile phones. The leader of each business unit may find desired gadgets in the market and install the gadgets by clicking Install. Then, the gadgets are automatically installed on corresponding servers. These gadgets can also be automatically installed on scale-out servers. The gadgets are developed by field developers. You can also upload your own gadgets to O&M Market to share with others.

API

  • CMD: If this API is called and the information about the target server and command is input, the server can run the command. Any commands run on a server that has been logged on to can be called by the CMD API.

Agent

  • Hostinfo: This agent acquires information about a server, including the server name, IP address, and SN.
Image for post
Image for post

In the figure, the left shows the web terminal, which is automatically authenticated and can be embedded into any web page by using JavaScript. The right shows the batch command feature. After certain servers are selected, a command input on this page will be issued to all these servers.

Architecture

Logical Architecture

Image for post
Image for post

StarAgent has a three-layer architecture. An agent is installed on each server, which has a persistent connection to the channel. Then, the channel regularly reports the information about the connected agent to the center. The center maintains the complete relational data between the agent and the channel. There are two processes to explain here:

1. Agent registration

An agent has a default configuration file. After the agent starts, it connects to ConfigService first and reports its IP address and SN to ConfigService. Then, ConfigService calculates the channel cluster to which the agent can connect and sends a channel list to the agent. After receiving the channel list, the agent disconnects from ConfigService and establishes a persistent connection to a channel.

2. Command issuance

An external system must call the proxy to issue a command request. After receiving the command request, the proxy searches for the corresponding channel based on the target server, and then distributes the task to the channel. Then, the channel forwards the command to an agent for execution.

Deployment Architecture

Image for post
Image for post

Internet data centers (IDCs) are at the bottom of the architecture. A channel cluster is deployed for each IDC. An agent can establish a persistent connection to any channel in the cluster. The center is at the upper layer where two equipment rooms are deployed for disaster recovery. Both data centers work online, and a failed data center does not affect services.

Problems and Challenges

The following figure shows the problems we encountered during system reconstruction in the year before last.

Image for post
Image for post

The first three problems are similar, and are mainly caused by incorrect task status. The manager in version 1.0 can be considered as the proxy in version 2.0, and a server is equivalent to a channel. A large number of systems are issuing commands online all the time. In version 1.0, if any role of the manager, server, or agent restarts, all tasks on the link will fail. For example, after the server restarts, the agent connected to the server will be disconnected because the link is lost. Therefore, any commands that are issued through this server have no output. Restarting a server can also cause an unbalanced load, that is, the 6th problem. Assume that there are 10 thousand computers in an IDC and 5 thousand of them are connected to either server respectively. After one of the two servers is restarted, all of the 10 thousand computers are connected to the other server.

In case of the failure of any API call to issue a command, users always ask us to identify the cause. Sometimes, the failure is caused by a system problem, but in most cases it is caused by environment problems such as server breakdown, SSH disconnection, heavy load, and exhausted disks. With a million servers in total, if the problem occurrence ratio per day is 1%, 10 thousand servers will be pending troubleshooting. You can easily imagine how great the number of inquiries would be in this case. At that time, we were extremely stressed. Half of our team members were busy answering inquiries every day. If there were disconnection drills at midnight, we still had to get up to restart services for recovery.

But, how did we solve these problems? We classified the problems into system problems and environment problems.

Image for post
Image for post

System Problems

We reconstructed the entire system and used a distributed message architecture. For example, each command issuance is considered as a task. In version 2.0, we added a status for each task. After receiving a command request, the proxy records the command request, sets the task status to received, and then sends the command request to an agent. Once it receives the task, the agent sends a response. After receiving the response from the agent, the proxy sets the task status to executing. After completing the task, the agent actively reports the task execution result to the proxy. After receiving the result, the proxy sets the task status to completed.

In this process, an acknowledgement mechanism is applied to messages between the proxy and the agent. If either role fails to receive an acknowledgement, the role will retry. In this case, if either role restarts during the task execution, the task itself will be slightly affected.

In version 2.0, servers in a channel cluster communicate with each other to regularly report information including the number of connected agents. Based on the received information and its own information, a server will determine whether it is connected to too many agents. If yes, the server automatically disconnects from agents that have not executed any tasks recently. In this way, load balance can be achieved. The central node has a persistent connection to each channel and records the number of agents connected to each channel. If the central node detects that a data center has a channel that is abnormal or consumes a large capacity, the central node automatically triggers scale-out or borrows a channel from another data center. After the capacity surge ends, the central node automatically removes the extra channel.

Environment Problems

In version 2.0, detailed error codes are available for the proxy, channel, and agent layers. These error codes can be used to troubleshoot causes for task errors.

To solve server problems, we managed to connect to the data in the monitoring system. When a task fails, an environment check is triggered to check information such as the downtime, disk capacity, and load. If a problem is detected, an API directly returns the server problem and the server owner. In this way, users can understand the causes and reach the appropriate personnel based on the returned result. In addition, we also provide diagnosis capabilities through the DingTalk robot. Using these capabilities, you can troubleshoot and pinpoint problems by @DingTalk robot in the chat group in your daily work.

Image for post
Image for post

Stability

According to my introduction, you can find that StarAgent is actually the O&M infrastructure, which is like water, electricity, and gas in our daily life. All your operations on servers depend on StarAgent. When both StarAgent and online services become faulty, the service faults cannot be rectified because servers cannot be operated and no release and change can be made. Due to demanding requirements for system stability, StarAgent is deployed in two data centers in the same city and in multiple IDCs in different regions for disaster recovery. StarAgent depends on storage systems such as MySQL, Redis, and HBase that are of high availability themselves. In addition, we deploy the storage systems in redundancy to ensure that services are not affected by the failure of a single storage system. In my opinion, few systems in the industry are as stable as StarAgent.

Security

500 thousand servers can be manipulated per minute. Tens of thousands of servers can be operated by inputting a command and pressing Enter. However, if the operation is malicious, the impact could be really huge. Therefore, StarAgent provides the feature of blocking high-risk commands, which can be used to automatically identify and block high-risk operations. In addition, the entire calling link is encrypted and signed to ensure that no third party can crack or tamper with the link. To prevent API accounts from being disclosed, we developed the command mapping function, which can modify the mapping of commands in the operating system. For example, to run the reboot command, you must input a1b2. In addition, the mapping relationship of each API account is different.

Environment

As mentioned earlier, environment problems such as breakdown can be solved by connecting to monitoring data. Next, I will introduce how to solve inconsistency between data input in the configuration management database (CMDB) and the data collected by an agent. The data includes basic information such as the SN and IP address. Usually, you read server information from the CMDB and then call StarAgent. If the data is inconsistent, the call will fail. Then, why does inconsistency with the SN and IP address occur?

Generally, data in the CMDB is input manually or by other systems. However, the data in StarAgent is obtained from actual servers. The main boards of certain servers are not programmed with any SNs and certain servers have multiple NICs. Therefore, the environment is complex.

To solve these problems, we established unified specifications, including specifications for acquiring SNs and IP addresses. These specifications allow you to define the SNs and IP addresses of servers. An acquisition tool is also provided to work with the specifications. In addition to StarAgent, this acquisition tool can also be used in any other scenarios in which server information needs to be acquired. When specifications are updated, we will update the gadgets synchronously to maintain the transparency to upper-layer businesses.

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store