How Cloud Computing Has Revolutionized O&M?
Now is the best time to develop operation and maintenance (O&M) more professionally.
Lu Tang, who is in charge of elastic computing stability at Alibaba Cloud, mentioned that, “Ops was created only with experience without officially being considered as a specific field. Now, based on the previous experience, it provides standardized services to the outside world.”
He believes that this era helps make O&M truly professional. Accordingly, O&M personnel can reduce costs and improve efficiency for enterprises with their skills and experience. However, this can be a challenging time for many O&M professionals. Today, they need to know not only the business architecture and code but also the kernel and hardware, together with various O&M tools.
With the development of cloud computing, DevOps, and other technology trends, O&M personnel are facing more and more challenges. The problems they faced and the ways they worked before are also being redefined.
This article discusses the concept of O&M in the following three aspects:
- New changes and trends in O&M
- How the working boundary and path of O&M personnel are redefined in the cloud computing era?
- What characteristics should a good O&M system have?
1) New Trends in O&M
Apart from a small number of IT O&M personnel in early large enterprises, China’s O&M industry has emerged in-line with the development of the internet industry in the 1990s. Therefore, the O&M of the internet industry takes the lead and guides the trends and directions in the O&M field.
After 20 years, the internet era has now entered a mature stage. Traditional enterprises are prioritizing digitization, and the challenges and environments that O&M personnel need to deal with have transformed a lot.
1.1) The increasing complexity of enterprise IT systems and greater O&M challenges require a higher degree of automation.
Enterprise IT systems are becoming more and more complex due to increasing digitalization and business growth. A large number of network devices, servers, middleware, and microservices of business systems make the job of IT O&M personnel challenging. Even if working overtime for maintenance, deployment, and management, the business will often face interruptions due to various failures, seriously impacting operations.
Meanwhile, as the market becomes more competitive, businesses should iterate faster to seize market opportunities, especially in the internet industry. The speed of product marketization or iterations has become integral to success. Supporting the rapid business iterations has become another challenge for O&M personnel. Obviously, manual O&M is barely sustainable, so the O&M industry in China turned to automation.
As the saying goes, a handy tool makes a handyman. With technological developments and O&M tools for automation — including event monitoring and early-warning, automated deployment, automated orchestration, and self-diagnosis — enterprises can improve the O&M efficiency.
1.2) After cloud computing evolved and leveraged extensively, many changes occurred in O&M objects, O&M tools, and even skills. Consequently, the rise of DevOps has attracted attention.
Generally, many enterprises divide O&M into two levels:
Infrastructure O&M: It mainly focuses on IT infrastructure management, including monitoring, notifications and alarms, maintenance, and deprovisioning of physical resources such as servers, switches, and networks.
Application O&M: It mainly focuses on business O&M, including the release and removal of some business applications, release deployment, scaling, and other functions.
For a business, efficiency improvements in application O&M can more directly accelerate the efficiency and growth of business iterations. While the infrastructure O&M serves as the foundation, enterprises building their own data centers focus their O&M primarily on infrastructure.
Cloud computing is characterized by “software or services define everything.” Cloud vendors undertake the maintenance and virtualization of the underlying infrastructure. After migrating to the cloud, the main objective of enterprise O&M has shifted from hardware — such as servers — to service API-oriented O&M, including host O&M and application O&M. Accordingly, automated deployment pipelines and continuous delivery DevOps are gaining more and more attention.
Technological development continuously tries to shield the underlying infrastructure and drive the developers’ attention away from the underlying resources, as is the case in serverless, function computing, and other trending topics.
In the earlier stages, there might be several O&M personnel in an enterprise jointly responsible for the maintenance of specific applications from “the bottom to the top.” However, even with more O&M personnel recruited due to business expansion, such a strategy of personnel working together for O&M is unsustainable. In fact, in many large enterprises, much O&M work has begun the preliminary “platformization,” namely to centralize the management of the underlying resources to reduce management costs. Such “platformization” has also promoted services and standards for public components within some enterprises. However, this method is not as efficient as the scale effect of cloud vendors.
The external form of platformization is cloudification. Observing from the inside of enterprises, cloudification has become an irreversible trend. To quote from an article, one of the important features of cloud computing is that it is “out-of-the-box,” with the cloud providers offering centralized O&M management and delivering to end-users as a service. This frees cloud users from much of the tedious daily O&M work and allows them to pay attention to their business development, thereby improving the entire industry’s operational efficiency.
1.3) The Rise of artificial intelligence and big data
In recent years, common O&M concepts not only included DevOps but also DataOps and AIOps, reflecting the need for intelligent and data-driven O&M.
Intelligence is the higher pursuit of automation to save time for the O&M personnel further. Artificial intelligence (AI) is now widespread in all fields that can be automated, and the O&M field is no exception, which must be one of the essential directions for development. However, intelligent O&M is a bit far away as most enterprises have not yet fully automated or even initially codified at scale.
2) New Working Boundary and Implementation Path
Changes to the environment have brought about three trends of automation and standardization, development and operations (DevOps), and artificial intelligence for IT operations (AIOps). Relevant concepts, even a complete transformation, are needed for the enterprise O&M system. Alibaba Cloud ECS team believes that to build a future-oriented O&M system, besides focusing on the above new trends, it is necessary to pay attention to the changes in the working boundary and implementation paths of enterprise O&M in the cloud era.
Among many trends in this era, the large-scale popularization of cloud computing undoubtedly had the most significant impact on O&M. The migration of business applications to the cloud led to fewer underlying O&M tasks and a large number of discussions about the O&M personnel crisis.
The ultimate goal of O&M personnel is to help the business realize business value by efficiently coordinating IT resources.
The four aspects most concerned by O&M are efficiency improvement, stability, security, and cost optimization.
Nowadays, these four aspects are still what the O&M personnel are pursuing. However, in the cloud computing era, the working boundaries, implementation methods, and paths have undergone tremendous changes.
2.1) Continuous efficiency improvement: Single-point automation to standardization
Originally, it was common to improve efficiency by writing shell scripts or with the help of open source tools. However, this approach is often single point, separated, and non-standardized. Sometimes, two engineers use different scripts and tools. Additionally, due to different O&M organizational structures and divisions of work in an enterprise, redundant capacity construction or an information isolated island may occur, causing lower O&M efficiency.
Therefore, we can say that O&M in the past was created based on experience and was not systematic enough. But experience often relies on personal accumulation.
DevOps, GitOps, and Infrastructure as Code (IaC) programmable infrastructures are emerging today to change this single-point, non-systematic “automation.” Cloud computing provides various out-of-the-box tools that drive DevOps forward, on top of shielding the underlying hardware. This changes the keyword for O&M efficiency improvement into coding and standardization. In combination with the characteristics of their own enterprises, O&M personnel provide the platform-based experience after abstraction and productization to R&D engineers.
2.2) Stability and reliability with less attention to the underlying layer and more attention to applications and services
Stability was the cornerstone for O&M. Traditional O&M deals with physical machines and network devices. In addition, enterprises must build a disaster recovery, monitoring, and alerting system to ensure stable service operations.
Today, cloud computing platform-based technologies such as large-scale geo-disaster recovery and hot migration have achieved relatively high service levels. The O&M personnel of the enterprises may only need to use a few simple APIs or clicks based on the suggestions of cloud vendors to avoid the impact of infrastructure on business. Now, they can do what they need to do simply with clicks.
However, since service stability equals infrastructure stability plus code stability, the O&M teams spend more time focusing on the stability of applications and services. At last year’s Global O&M Conference, the concepts “technical operations” and “BizOps” began to emerge, both of which are new value directions for O&M.
The era of O&M focusing on machines is over. Technical operations require that O&M personnel participate more in business and improve the user experience. For example, various issues should be considered, such as whether the cluster should be scaled out during the promotions, whether the bandwidth is sufficient, and examine the stress testing data. BizOps advocates the feedback and interaction between the application O&M engineers — who know the system health best — and the demand side business personnel. The idea is that “O&M makes good systems.”
2.3) Safety: Self-responsibility to shared responsibility
There are many security dimensions from vulnerability prevention and network attack prevention to code checking, permission management, and log auditing commonly used by enterprises — and all the way to trusted computing of higher levels, full-link encryption, and so on.
In large enterprises, specially set up security teams may have to perform these. For example, for log auditing, a security team needs to collect each log record, analyze and match it one by one, and iterate it with the iteration of the business code. This high complexity also makes many small businesses give up or use expensive third-party solutions.
Cloud services directly provide security with multi-level and comprehensive processes and support fine-grained permission management. For example, all operations on the cloud are recorded and can be audited and traced afterward, which undoubtedly costs enterprises significant. Alibaba Cloud’s VPC network provides convenient network isolation and traffic control for enterprises. The latest cloud servers of Alibaba Cloud Elastic Compute Service (ECS) are all equipped with security chips to enable the trusted startup of servers without tampering. On this basis, the encrypted computing isolation ring enclave is used to ensure that data is available but invisible. This also meets the financial-level security requirements.
When the internet data center (IDC) was popular, enterprises were solely responsible for IT security. In recent years, the shared cloud security model has been accepted universally in the industry. Cloud vendors are accountable for the security of the cloud infrastructure, while users are responsible for the security of business applications above the virtualized layer. Users can select appropriate products in the cloud security market to secure their content, platform, application, system, and network. At the same time, users must implement permission control well to avoid problems such as deleting the libraries.
2.4) Cost optimization: Fixed cost to FinOps
Technically, the “software defines everything” feature of cloud computing has changed the working methods of O&M and developers. Its “elastic” feature also provides a “cost optimization method” for enterprises to minimize idle resources.
In terms of business model, the “leasing” model of cloud computing is different from the traditional IT hardware procurement. The financial needs of enterprises should transform from capital expenditure (CAPEX) to operational expenditure (OPEX). Cloud computing has a wide range of cost calculation modes, further helping enterprises achieve the best balance between IT flexibility and low cost. Therefore, for O&M personnel, operations on the cloud mean a shift in thinking about cost optimization.
As enterprises migrate more of their core business applications from data centers to the cloud, there is an increasing need for budgeting, cost accounting, and optimization in cloud-based environments. It is a significant conceptual and technological change from a fixed cost model to a variable and pay-as-you-go model of cloud billing. However, most enterprises do not yet have a clear understanding of cloud financial management and technical means. The State of FinOps Report 2020 indicated that nearly half of the respondents (49%) have little or no automated methods to manage cloud spending.
FinOps is introduced to help enterprises better understand cloud costs and IT benefits. It is a way of cloud financial management and a change in the enterprise IT operating model. Its goal is to improve the organization’s understanding of cloud costs to enable better decision-making. In August 2020, the Linux Foundation announced the establishment of the FinOps Foundation to advance the cloud financial management discipline through best practices, education, and standards.
A practitioner in the FinOps community shared a practice case from the banking industry. The story is that the monthly cost was reduced by 60% compared with the local deployment cost by transforming an application’s architecture to serverless. He pointed out that the results in reducing the cloud costs have been mixed, which is affected by the level of cloud cost optimization in enterprises. He divided cloud technologies into three stages of crawling, walking, and running. After using cloud-based cost optimization methods, enterprises can achieve significant cost optimization results.
Cloud vendors are currently improving their support for FinOps, helping enterprise financial processes better adapt to the variability and dynamic nature of cloud resources. For example, AWS cost explorer and Alibaba Cloud cost centers can help enterprises better analyze and allocate costs. Moreover, enterprises also need to reduce costs using agile auto-scaling, service selection, infrastructure as a service (IaaS), and flexible cost calculation modes to get the most out of the cloud.
3) O&M System — Four Features
To recap, cloud vendors complete the monitoring and scheduling of hardware devices and hardware in the cloud. The focus of enterprise O&M has changed to the design and construction of the internal O&M system. In other words, it is necessary to abstract and productize the experience in deep combination with the enterprise characteristics to form a set of O&M systems that fit the enterprise itself.
Based on the new trends in O&M automation, DevOps, AIOps, and DataOps, as well as the changes in O&M boundary in the cloud era, a good O&M system should have the following four characteristics.
3.1) Automation and standardization: Using concepts such as DevOps and Infrastructure as Code (IaC)
The foundation of DevOps is not just IaC, but everything as code. Only with the code standardization can be achieved, and O&M platforms and developers can communicate smoothly through standard APIs. Coding is also the basis for the ultimate goal of “AIOps” or “NoOps.”
The ECS automated O&M kit released by Alibaba Cloud ECS embodies the concept and design of IaC. Resource Orchestration Service (ROS) and Operation Orchestration Service (OOS) allow users to implement automated deployment and batch O&M operations through templates, together with drag-and-drop operations for convenience. Gartner, a research organization, recognized “automated cloud orchestration and optimization” in 2021 as top 10 cloud computing trends. Alibaba Cloud’s ROS and OOS, AWS’s CloudFormation, and Terraform are similar automated orchestration tools.
The Alibaba Cloud ECS automated O&M kit provides complete and comprehensive monitoring of underlying resources and is available to users in the form of events. Users can subscribe to cloud resources through OpenAPI or cloud monitor to facilitate the building of an event-driven automated O&M system. This is also a foundation to construct an automated O&M system.
3.2) Specific permission management and rapid-integrated security capabilities
With operation tracing and auditing, permission management can effectively control enterprise security risks, prevent database deletion and other events, and enable investigation and review afterward.
The cloud assistant of the Alibaba Cloud ECS automated O&M kit records all the operations in ECS. Orchestration tools such as ROS and OOS also support permission management. As mentioned above, Alibaba Cloud has comprehensive security capabilities. In fact, enterprises build their O&M systems in Alibaba Cloud with automated tools and leverage the underlying intelligent O&M capabilities of Alibaba Cloud. Those are the complete O&M systems that enterprises enjoy on Alibaba Cloud.
3.3) Comprehensive coverage, including automated performance management and cloud finance management tools, which can assist cloud cost optimization.
In the early days, O&M usually focused on single-point automation. However, the O&M system should focus on automating the overall process, covering the entire lifecycle of resources and business applications.
The Alibaba Cloud ECS automated O&M kit provides the entire lifecycle management for cloud servers, including cloud migration, deployment, routine O&M, and elastic scaling. Elastic Scaling Service (ESS) and Auto Provisioning Group (APG) perform resource scaling based on different scenarios. Resource optimization advisors can identify resources with low usage, and users can adjust these resources to improve resource utilization and reduce costs.
3.4) Intelligence and digitalization
Full implementation of AIOps is still ideal for most enterprises. Still, an O&M system at least has the basis for upgrading to intelligence, that is, coding standardization or with part of intelligent functions. In the Alibaba Cloud ECS automated O&M kit, intelligence is mainly used in the cloud keeper. Cloud keeper refers to a series of intelligent functions of Alibaba Cloud ECS that users can hardly perceive, covering automatic fault diagnosis and repair, automatic monitoring, and analysis and optimization of resources. In addition, intelligent O&M capabilities such as hot migration at the underlying layer of Alibaba Cloud ECS are also included.
DevOps and cloudification are two main trends from IDC-host to the cloud-host era and the build-on cloud era. Neither personnel of O&M and R&D, nor enterprises and cloud vendors, can reverse it.
It is better to embrace new technological trends and make the trends into technological dividends and competitiveness instead of lamenting the rapid pace of the times. Practitioners are actively gaining relevant knowledge, while Alibaba Cloud, as a cloud vendor, also hopes to implement DevOps in China and help Chinese enterprises improve their digitization and automation capabilities.