Unveiled for Double 11, StarAgent: Alibaba’s Automatic O&M System

StarAgent

Product Introduction

  • Server layer: This layer comprises all servers that are installed with StarAgent by default.
  • O&M layer: This layer comprises various O&M and management systems, including the application O&M system, database O&M system, middleware O&M system, and security system. Each specialty product has an independent portal where StarAgent is used to manipulate servers.
  • Business layer: This layer comprises all business units. Most business units directly use the O&M and management systems at the O&M layer, but some other business units may have special requirements. In this case, these odd business units will develop special O&M portals based on their underlying capabilities.

Scenarios

  • Asset check: After a server is mounted, it is set to start up from the network and is installed with a mini operating system that runs in the memory. This operating system includes StarAgent. After the operating system is started, StarAgent will issue a command to retrieve the hardware information of the server for asset checking, including the CPU, memory, and manufacturer and size of disks.
  • Operating system installation: After a server is delivered to a business unit, an operating system will be installed on the server. The type of the operating system to be installed is determined by a command issued by StarAgent in the memory.
  • Environment configuration: After the operating system is successfully installed, StarAgent initializes the basic environment of the server, including the account, common O&M scripts, and scheduled tasks.
  • Application release: StarAgent can release application configurations and software packages.
  • Running monitoring: After applications and services go online, StarAgent installs monitoring scripts for them and installs a monitoring agent.

Product Data

Features

  • Management channel: All O&M operations will be converted to commands that can be run on servers. This command channel is unique on the entire network, which provides capabilities such as controlling user permissions, auditing operations, and blocking high-risk commands.
  • System configuration: After startup, StarAgent automatically initializes configurations such as common O&M scripts, scheduled tasks, system accounts, and monitoring agents. The operating system contains StarAgent by default. Therefore, StarAgent can automatically initialize the basic O&M environment of a server after startup.

Portal

  • O&M Market: It is also referred to as the plug-in platform, which is similar to the app market on mobile phones. The leader of each business unit may find desired gadgets in the market and install the gadgets by clicking Install. Then, the gadgets are automatically installed on corresponding servers. These gadgets can also be automatically installed on scale-out servers. The gadgets are developed by field developers. You can also upload your own gadgets to O&M Market to share with others.
  • Web Terminal: If you click a server on the portal, a terminal appears automatically. This is equivalent to logging on to the server over SSH. The terminal is automatically authenticated based on the current logon account. It can also be embedded into any other web pages by using JavaScript.
  • File Distribution: It is easy to understand and is not described here.
  • Scheduled Tasks: They are similar to crontab scheduled tasks. However, our scheduled tasks can be rapidly executed in seconds and at different times. For example, a scheduled task is added for certain servers, which will be executed once a minute. When crontab is used, the scheduled task will always be executed in the first second of each minute for all the servers. Our scheduled task is also executed once a minute, but it is executed at a different time on each server.
  • Server Account: It can be used as an account that is available to an individual to log on to a server, a public account such as an admin account on a server, and the SSH channel connecting one server with other servers.
  • API Account: It is closely related to the APIs in the next column. To use these APIs, you must apply for an API account in advance.

API

  • CMD: If this API is called and the information about the target server and command is input, the server can run the command. Any commands run on a server that has been logged on to can be called by the CMD API.
  • Plugin: This API corresponds to O&M Market in the left column. If you install certain scripts on a server by using O&M Market, the Plugin API can be called to execute these scripts.
  • File and Store: Both APIs are used for file distribution. They are different in that the File API depends on downloading sources and the Store API can directly post script content when the HTTP API is called. The File API is implemented based on peer-to-peer (P2P) technology. Alibaba has a product named “Dragonfly”, which is used for file downloading. This product has the advantage of returning to the source only once when hundreds or thousands of servers are downloading files at the same time, which imposes light pressure on the source. This product also supports shared downloading between servers and has already been made open-source.
  • Action: This API is called to combine the preceding APIs. For example, this API can call the File API to download a script, and then call the CMD API to execute the script. The Action API also supports judgment with the AND and OR conditions. It will call the CMD API to execute the script only after the script is successfully downloaded.

Agent

  • Hostinfo: This agent acquires information about a server, including the server name, IP address, and SN.
  • Data channel: The output data of commands or scripts that are executed on each server is directly sent to the data channel. Then, the data channel uploads the data to the center where the data will be consumed.
  • Incremental logs and P2P files: Both agents are developed by third parties, which can be installed on each server as plug-ins from O&M Market.

Architecture

Logical Architecture

Deployment Architecture

Problems and Challenges

System Problems

Environment Problems

Stability

Security

Environment

Original Source

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com