Unveiled for Double 11, StarAgent: Alibaba’s Automatic O&M System
By Song Jian, nickname Song Yi, technical expert for Alibaba Cloud O&M middle end.
As an engineer working as part of the IT operations and maintenance (O&M) teams at a major internet company, do you still remember those nights when you had to get up super early just to restart the servers? Along that same train of thought, you may be wondering then, how ever did Alibaba keep millions of servers in the cloud keep running in a secure, stable and even efficient manner with all the huge traffic spikes they met up against during the Double 11 annual online shopping event?
Well, today, as a technical expert working in Alibaba’s O&M middle end team, I’d like to show you how it was done this year with Alibaba Cloud’s all-new automatic O&M system, StarAgent. StarAgent helped to keep this year’s Double 11 Global Shopping Festival afloat. In this article, I’m going to show you how StarAgent was capable of managing millions of servers and how StarAgent is as an important piece of the O&M infrastructure of Alibaba as is water, gas, or even electricity in our daily lives.
The original writer of this article, Song Jian is a technical expert that works on Alibaba’s O&M middle-end team. He possesses deep insights into the theory and practices of the large-scale O&M systems and automatic O&M system of StarAgent that are implemented at Alibaba. Song joined Alibaba around a decade ago. While at Alibaba, Song has worked on establishing Alipay’s basic monitoring system from and its very initial stages, has driven the integration of Alibaba Group’s monitoring systems, and he also led the PE team for O&M tools and testing.
In view of the R&D Collaboration 2.0 intelligent O&M platform (or just StarOps for short), O&M is implemented on two platforms, which are the basic O&M platform and the application O&M platform. The basic O&M platform is referred to as StarAgent, which is definitely the IT O&M infrastructure of Alibaba.
From 10 thousand to 100 thousand and then to millions of servers, the significance of the O&M infrastructure has been gradually perceived and discovered. When our O&M systems could not meet the rapid growth of servers and business in terms of stability, performance, and capacity, we upgraded our architecture in 2015. Then, StarAgent came in and increased the success ratio from 90% to 99.995% and the number of daily calls from 10 million to over 100 million.
At that time, there were few enterprises that used millions of servers in the world and many enterprises were divided into different business units that managed their servers separately. There were fewer scenarios in which a system managed millions of servers. Therefore, we had little experience to learn from others. For most time, we explored the way by ourselves and our system gradually evolved to what it looks like today.
As shown in the preceding figure, StarAgent comprises the server layer, O&M layer, and business layer. Different teams cooperate by layer. The preceding figure also shows the position of StarAgent in Alibaba Group. It is the unique agent officially recognized by Alibaba Group.
- Server layer: This layer comprises all servers that are installed with StarAgent by default.
- O&M layer: This layer comprises various O&M and management systems, including the application O&M system, database O&M system, middleware O&M system, and security system. Each specialty product has an independent portal where StarAgent is used to manipulate servers.
- Business layer: This layer comprises all business units. Most business units directly use the O&M and management systems at the O&M layer, but some other business units may have special requirements. In this case, these odd business units will develop special O&M portals based on their underlying capabilities.
StarAgent functions throughout the lifecycle of a server:
- Asset check: After a server is mounted, it is set to start up from the network and is installed with a mini operating system that runs in the memory. This operating system includes StarAgent. After the operating system is started, StarAgent will issue a command to retrieve the hardware information of the server for asset checking, including the CPU, memory, and manufacturer and size of disks.
- Operating system installation: After a server is delivered to a business unit, an operating system will be installed on the server. The type of the operating system to be installed is determined by a command issued by StarAgent in the memory.
- Environment configuration: After the operating system is successfully installed, StarAgent initializes the basic environment of the server, including the account, common O&M scripts, and scheduled tasks.
- Application release: StarAgent can release application configurations and software packages.
- Running monitoring: After applications and services go online, StarAgent installs monitoring scripts for them and installs a monitoring agent.
Routine O&M: StarAgent can be used to perform routine O&M tasks, including server logon, single-server operations, and batch operations, for example, cleaning up before an application goes offline.
Certain data in the preceding figure comes from Alibaba. StarAgent performs over 100 million server operations every day, manipulates 500 thousand servers per minute, and manages millions of servers. StarAgent has over 150 plug-ins, uses fewer agent resources, and supports mainstream editions of Linux and Windows.
StarAgent provides two core features: management channel and system configuration.
- Management channel: All O&M operations will be converted to commands that can be run on servers. This command channel is unique on the entire network, which provides capabilities such as controlling user permissions, auditing operations, and blocking high-risk commands.
- System configuration: After startup, StarAgent automatically initializes configurations such as common O&M scripts, scheduled tasks, system accounts, and monitoring agents. The operating system contains StarAgent by default. Therefore, StarAgent can automatically initialize the basic O&M environment of a server after startup.
According to the specific features of portals, APIs, and agents, portals are mainly used by field development and O&M personnel, APIs are called by upper-layer O&M systems, and agents host the capabilities that can be directly used on each server.
- O&M Market: It is also referred to as the plug-in platform, which is similar to the app market on mobile phones. The leader of each business unit may find desired gadgets in the market and install the gadgets by clicking Install. Then, the gadgets are automatically installed on corresponding servers. These gadgets can also be automatically installed on scale-out servers. The gadgets are developed by field developers. You can also upload your own gadgets to O&M Market to share with others.
- File Distribution: It is easy to understand and is not described here.
- Scheduled Tasks: They are similar to crontab scheduled tasks. However, our scheduled tasks can be rapidly executed in seconds and at different times. For example, a scheduled task is added for certain servers, which will be executed once a minute. When crontab is used, the scheduled task will always be executed in the first second of each minute for all the servers. Our scheduled task is also executed once a minute, but it is executed at a different time on each server.
- Server Account: It can be used as an account that is available to an individual to log on to a server, a public account such as an admin account on a server, and the SSH channel connecting one server with other servers.
- API Account: It is closely related to the APIs in the next column. To use these APIs, you must apply for an API account in advance.
- CMD: If this API is called and the information about the target server and command is input, the server can run the command. Any commands run on a server that has been logged on to can be called by the CMD API.
- Plugin: This API corresponds to O&M Market in the left column. If you install certain scripts on a server by using O&M Market, the Plugin API can be called to execute these scripts.
- File and Store: Both APIs are used for file distribution. They are different in that the File API depends on downloading sources and the Store API can directly post script content when the HTTP API is called. The File API is implemented based on peer-to-peer (P2P) technology. Alibaba has a product named “Dragonfly”, which is used for file downloading. This product has the advantage of returning to the source only once when hundreds or thousands of servers are downloading files at the same time, which imposes light pressure on the source. This product also supports shared downloading between servers and has already been made open-source.
- Action: This API is called to combine the preceding APIs. For example, this API can call the File API to download a script, and then call the CMD API to execute the script. The Action API also supports judgment with the AND and OR conditions. It will call the CMD API to execute the script only after the script is successfully downloaded.
- Hostinfo: This agent acquires information about a server, including the server name, IP address, and SN.
- Data channel: The output data of commands or scripts that are executed on each server is directly sent to the data channel. Then, the data channel uploads the data to the center where the data will be consumed.
- Incremental logs and P2P files: Both agents are developed by third parties, which can be installed on each server as plug-ins from O&M Market.
StarAgent has a three-layer architecture. An agent is installed on each server, which has a persistent connection to the channel. Then, the channel regularly reports the information about the connected agent to the center. The center maintains the complete relational data between the agent and the channel. There are two processes to explain here:
1. Agent registration
An agent has a default configuration file. After the agent starts, it connects to ConfigService first and reports its IP address and SN to ConfigService. Then, ConfigService calculates the channel cluster to which the agent can connect and sends a channel list to the agent. After receiving the channel list, the agent disconnects from ConfigService and establishes a persistent connection to a channel.
2. Command issuance
An external system must call the proxy to issue a command request. After receiving the command request, the proxy searches for the corresponding channel based on the target server, and then distributes the task to the channel. Then, the channel forwards the command to an agent for execution.
Internet data centers (IDCs) are at the bottom of the architecture. A channel cluster is deployed for each IDC. An agent can establish a persistent connection to any channel in the cluster. The center is at the upper layer where two equipment rooms are deployed for disaster recovery. Both data centers work online, and a failed data center does not affect services.
Problems and Challenges
The following figure shows the problems we encountered during system reconstruction in the year before last.
The first three problems are similar, and are mainly caused by incorrect task status. The manager in version 1.0 can be considered as the proxy in version 2.0, and a server is equivalent to a channel. A large number of systems are issuing commands online all the time. In version 1.0, if any role of the manager, server, or agent restarts, all tasks on the link will fail. For example, after the server restarts, the agent connected to the server will be disconnected because the link is lost. Therefore, any commands that are issued through this server have no output. Restarting a server can also cause an unbalanced load, that is, the 6th problem. Assume that there are 10 thousand computers in an IDC and 5 thousand of them are connected to either server respectively. After one of the two servers is restarted, all of the 10 thousand computers are connected to the other server.
In case of the failure of any API call to issue a command, users always ask us to identify the cause. Sometimes, the failure is caused by a system problem, but in most cases it is caused by environment problems such as server breakdown, SSH disconnection, heavy load, and exhausted disks. With a million servers in total, if the problem occurrence ratio per day is 1%, 10 thousand servers will be pending troubleshooting. You can easily imagine how great the number of inquiries would be in this case. At that time, we were extremely stressed. Half of our team members were busy answering inquiries every day. If there were disconnection drills at midnight, we still had to get up to restart services for recovery.
But, how did we solve these problems? We classified the problems into system problems and environment problems.
We reconstructed the entire system and used a distributed message architecture. For example, each command issuance is considered as a task. In version 2.0, we added a status for each task. After receiving a command request, the proxy records the command request, sets the task status to received, and then sends the command request to an agent. Once it receives the task, the agent sends a response. After receiving the response from the agent, the proxy sets the task status to executing. After completing the task, the agent actively reports the task execution result to the proxy. After receiving the result, the proxy sets the task status to completed.
In this process, an acknowledgement mechanism is applied to messages between the proxy and the agent. If either role fails to receive an acknowledgement, the role will retry. In this case, if either role restarts during the task execution, the task itself will be slightly affected.
In version 2.0, servers in a channel cluster communicate with each other to regularly report information including the number of connected agents. Based on the received information and its own information, a server will determine whether it is connected to too many agents. If yes, the server automatically disconnects from agents that have not executed any tasks recently. In this way, load balance can be achieved. The central node has a persistent connection to each channel and records the number of agents connected to each channel. If the central node detects that a data center has a channel that is abnormal or consumes a large capacity, the central node automatically triggers scale-out or borrows a channel from another data center. After the capacity surge ends, the central node automatically removes the extra channel.
In version 2.0, detailed error codes are available for the proxy, channel, and agent layers. These error codes can be used to troubleshoot causes for task errors.
To solve server problems, we managed to connect to the data in the monitoring system. When a task fails, an environment check is triggered to check information such as the downtime, disk capacity, and load. If a problem is detected, an API directly returns the server problem and the server owner. In this way, users can understand the causes and reach the appropriate personnel based on the returned result. In addition, we also provide diagnosis capabilities through the DingTalk robot. Using these capabilities, you can troubleshoot and pinpoint problems by @DingTalk robot in the chat group in your daily work.
According to my introduction, you can find that StarAgent is actually the O&M infrastructure, which is like water, electricity, and gas in our daily life. All your operations on servers depend on StarAgent. When both StarAgent and online services become faulty, the service faults cannot be rectified because servers cannot be operated and no release and change can be made. Due to demanding requirements for system stability, StarAgent is deployed in two data centers in the same city and in multiple IDCs in different regions for disaster recovery. StarAgent depends on storage systems such as MySQL, Redis, and HBase that are of high availability themselves. In addition, we deploy the storage systems in redundancy to ensure that services are not affected by the failure of a single storage system. In my opinion, few systems in the industry are as stable as StarAgent.
500 thousand servers can be manipulated per minute. Tens of thousands of servers can be operated by inputting a command and pressing Enter. However, if the operation is malicious, the impact could be really huge. Therefore, StarAgent provides the feature of blocking high-risk commands, which can be used to automatically identify and block high-risk operations. In addition, the entire calling link is encrypted and signed to ensure that no third party can crack or tamper with the link. To prevent API accounts from being disclosed, we developed the command mapping function, which can modify the mapping of commands in the operating system. For example, to run the reboot command, you must input a1b2. In addition, the mapping relationship of each API account is different.
As mentioned earlier, environment problems such as breakdown can be solved by connecting to monitoring data. Next, I will introduce how to solve inconsistency between data input in the configuration management database (CMDB) and the data collected by an agent. The data includes basic information such as the SN and IP address. Usually, you read server information from the CMDB and then call StarAgent. If the data is inconsistent, the call will fail. Then, why does inconsistency with the SN and IP address occur?
Generally, data in the CMDB is input manually or by other systems. However, the data in StarAgent is obtained from actual servers. The main boards of certain servers are not programmed with any SNs and certain servers have multiple NICs. Therefore, the environment is complex.
To solve these problems, we established unified specifications, including specifications for acquiring SNs and IP addresses. These specifications allow you to define the SNs and IP addresses of servers. An acquisition tool is also provided to work with the specifications. In addition to StarAgent, this acquisition tool can also be used in any other scenarios in which server information needs to be acquired. When specifications are updated, we will update the gadgets synchronously to maintain the transparency to upper-layer businesses.