Why Are the Top Internet Companies Choosing SRE over Traditional O&M?
By Yang Zeqiang, nicknamed Zhujian at Alibaba.
The successful conclusion of last year’s Double 11 Shopping Festival, the world’s largest online shopping event, has once again thrown a bright spotlight on Alibaba’s technological innovations and cloud computing capabilities, which allowed for 268.4 billion CNY (more than 38 billion USD) in sales all in a single 24-hour span while still maintaining excellent customer experience for its several million online users. When it comes this year’s Double 11 technology, we cannot forget to mention that Alibaba Group was able to move all of its core systems to the cloud in 2019 as well.
As the underlying product that made Alibaba Group’s migration to the cloud possible, Elastic Compute Service (ECS) represents Alibaba Group’s core cloud migration infrastructure. In order to ensure superior stability and performance during cloud migration process, the long-term efforts of the ECS team were essential. As one of a team of site reliability engineers, this post’s author, Yang Zeqiang participated in this revolution. In this post, Yang will be sharing some insights about the site reliability system that was put into place at Alibaba Cloud, as well as cast a light on why so many big tech firms are going all in on site reliability engineering, or SRE.
What Is Site Reliability Engineering or SRE?
Site reliability engineering (SRE) refers to the engineering involved in website reliability. Site reliability engineering was first proposed and applied by Google more than a decade ago. In recent years, SRE has been widely used by leading Internet companies inside and outside China. As we see it, Google and Netflix have on their hands the two most successful implementations of SRE in the industry. Google created a strong system, becoming the global authority in the field, and Netflix has taken SRE to new heights in terms of practice. It’s reported that, with fewer than 10 core site reliability engineers, Netflix is still able to support the O&M procedures needed for their service that is set up in 190 countries, sent to hundreds of millions of paying customers, and involves tens of thousands of microservice instances. With the development of DevOps in recent years, SRE has become a familiar concept in the industry. Leading Internet companies in China, like Baidu, Alibaba, Tencent, and Meituan, have all gradually incorporated SRE into their organizational structures and recruitment. In Alibaba Group, for example, different business units have set up own SRE teams. However, SRE responsibilities are different in different departments. So what are site reliability engineers ultimately doing for a company?
The Responsibilities of Site Reliability Engineers
At Google, site reliability engineers are primarily responsible for service availability, system performance, and capacity-related matters for all of Google’s core business systems. Based on the content in the company’s publication “Site Improvement Engineering”, the work of Google SRE system includes but is not limited to:
- Infrastructure capacity planning
- Production system monitoring
- Load balancing for production systems
- Release and change engineering management
- On-call and firefighting (emergency fault response)
- Work with the business team to solve difficult problems
In China, many SRE departments have similar responsibilities to traditional O&M departments. In essence, they are responsible for the technical O&M work that supports Internet services behind the scenes. Unlike traditional O&M SRE, we have been exploring and practicing SRE in the business R&D team for more than a year. I think the core of business team SRE is the redefinition of R&D and O&M through a software engineering methodology to drive and empower business evolution. In what follows, I will describe some practices for implementing SRE in Alibaba Cloud Elastic Compute Service and the thinking behind it.
Why Should We Establish an SRE System?
As the central product of Alibaba Cloud, Elastic Compute Service (ECS) is the main architecture behind Alibaba’s several business and e-commerce platforms. It runs the Group’s internal cloud services and cloud products. As the largest cloud-computing vendor in Asia, Alibaba Cloud serves large-, medium-, and small-size enterprise customers all over the world, including various private domains and private cloud systems. As the core scheduling brain, the successful management of ECS is of immense importance. With the acceleration of Alibaba Group’s cloud migration and the deployment of cloud products on ECS, hundreds of millions of API calls are made each day, with millions of ECS instances created every day. Because of all of this, the ECS management and scheduling system need to overcome several challenges in terms of capacity, system performance, and service availability:
- Databases encounter a bottleneck, as high-configuration database spaces still cannot support business development for a period of six months.
- As the number of slow SQL statements increases exponentially, application stability is at risk.
- Up to 200 some alerts are generated every day throughout the entire system, showing that the hidden risks in the systems are gradually building up to a crisis.
- The workflow framework faces bottlenecks and cannot support business volumes for three months. Therefore, manual O&M is highly risky.
- The high frequency of manual O&M reduces satisfaction with R&D work.
- Long-tail requests seriously affect the service quality and the number of status code errors of the 5xx format continues to rise, affecting the customer experience.
- Inconsistent resources cannot be converged in the long term, so the asset loss problem cannot be resolved.
The Birth of SRE
Before we created the ECS SRE team, one major challenge for my team was the question of how we could build a highly available and stable system while also allowing for rapid growth and support for business development over the next three to five years. Before the SRE team was established, the ECS team divided instances, storage, images, networks, executive support systems, and ROS based on several different business domains. With the preceding organizational structure, the R&D team was able to dig deep into different vertical channels of development. However, the team as a whole did not have the necessary perspective needed to understand all of our systems, and so it was difficult to see the overall picture.
Conway’s law holds that organizations should design system architectures that mirror their communication structures. Simply put, this law can be understood as: an organizational architecture should more or less spell out into their system architecture, with the design of the two being more or less the same when boiled down to the fundamental parts. Following this logic, we needed to be able to build a stability system by taking in the overall perspective that incorporates all of our different business systems and teams, as such thing would be the best guarantee to implement an organizational architecture. This is how the ECS SRE team came into being.
What Does SRE Do?
As we discussed in another section before, the responsibilities of Google’s SRE team include capacity planning, distributed system monitoring, load balancing, service fault tolerance, on-call, firefighting, and business collaboration support. We also briefly described Chinese SRE teams that focus on system O&M. While exploring the implementation of ECS SRE, we learned from past successes in the industry and formed a unique methodology and system of practices based on the business and team characteristics of the ECS team. My personal opinion is that there is no universal standard. We need to constantly explore solutions that incorporate the present situation, business characteristics, and team features. The following section describes how the ECS SRE team has worked to build a stability system.
Overview of the ECS SRE System
Capacity and Performance
Given that ECS can get up to hundreds of millions of API calls everyday, and the peak number of ECS instances created in a single day can be as high as a few million, the capacity and performance of management services face severe problems. For example, the database capacity could be exhausted and the system must deal with frequent long-tail requests. With Alibaba Group’s cloud migration and its deployment of cloud products on ECS along with the rapid development of its cloud-native environment, we urgently needed to take on measures to prepare for future problems. Consider Alibaba workflow engine, which is central to ECS control. With the rapid growth of business volumes, the data in a workflow task table can exceed 3 TB in just one month. This means that even high-configuration databases cannot support several months of business development. In addition to workflows, core order, purchase, and resource tables all face the same problem. Above anything else, in periods of rapid business development, ensuring business continuity is the most pressing issue we face. To solve the current capacity and performance problems and prepare for further expansion in the future, we have upgraded and renovated the basic components developed by ECS, including the workflow engine, as well as the idempotence, cache, and data cleansing frameworks. In order to empower other cloud products or teams in the future, all basic components are output in a standard manner using second-party packages.
Basic component upgrade: We upgraded the architecture of the basic business components developed by the ECS team to cope with future large-scale business growth. More specifically, we did the following:
- Workflow engine: We started to use a lightweight workflow engine independently developed by the ECS team in 2014 and similar to AWS SWF. After its renovation in 2018, the engine supports the creation of hundreds of millions of workflows. We are still working to improve its framework availability, capacity, and overall performance.
- Lightweight idempotence framework: Users can customize business idempotence rules using annotations, and the framework supports business idempotence through a lightweight and persistent method.
- Asynchronous data cleansing framework: Users can use annotations to configure business data cleansing policies.
- Cache framework: The framework supports business-defined cache hit and failure rules through annotations and supports batch caching.
Performance optimization: We started to use multidimensional performance optimization policies to improve the performance metrics of control services.
- JVM parameter tuning: We constantly adjust and optimize JVM parameters to reduce the number of FGC activities and the STW time, shorten the service unavailability time, and improve the user service experience.
- Data caching: Multi-level caching of core links reduces database I/O and improves service performance.
- SQL performance tuning: We optimize SQL execution efficiency to improve business performance.
- Core link RT optimization: Business optimization ensures the performance of core ECS creation and startup links.
- API batch service capabilities: Batch service capabilities improve overall service performance.
End-to-end Stability Governance
Systematic stability construction has been the most important aspect of our efforts over the past year. I believe that, in the area of stability governance, we must adopt an overall, end-to-end perspective with precise subdivisions from top to bottom. The following section will briefly introduce the stability governance system for ECS control.
1. Database Stability Governance
Databases are the vital lifelines of applications. For ECS control, all core businesses run on ApsaraDB for RDS. If a database fails, the damage to an application is critical on both the control plane and the data plane. Therefore, the first thing SRE does is to maintain these vital lifelines by exercising comprehensive control over database stability. First, let’s take a look at the problems faced by databases for the ECS team in large-scale business scenarios:
- The need for space is growing too fast, and databases will be unable to meet business development needs in the near future.
- The high frequency of slow SQL statements seriously affects application stability.
- The failure rate of database changes is high, and a high proportion of these faults are caused by DDL table changes.
- We often observe abnormalities in RDS performance metrics and various database performance metrics.
- The RDS alert configurations are chaotic, and alert information is often missing or false positives are reported.
When faced with database problems, our strategy is to focus on both databases and businesses. It is not enough to simply optimize the database or perform business optimization alone. For example, in a typical large table problem, a great deal of space is occupied and queries are slow. If we simply provide more space at the database level and perform index optimization, this can solve the problem in the short term. However, when the business scale is large enough, database optimization can only go so far, so business optimization is required. I will now briefly introduce some ideas for optimization:
- When a database occupies a large amount of space, we can use two methods to reduce the space occupied by the current database and control the future growth of the data space. By archiving historical data and releasing data holes, we can reuse data pages and control the disk space used by the database. However, deleting data does not release table space. To release large tables that already occupy a large amount of space, we have to change the business itself and solve the problem through intermediate table rotation.
- To cope with frequent slow SQL queries, we need to optimize the database and transform the business. At the database level, index optimization improves query efficiency and reduces invalid data to reduce the number of scanned rows. At the application level, caching reduces the number of database reads and business code optimization reduces and avoids slow SQL statements.
- When facing a high failure rate of database changes, we must enhance the control process and add a review process. There are many types of DDL changes. The limited expertise and sensitivity of developers to databases results in an increased number of database changes. In this case, we can add a checklist and review process for DDL changes to control the risks of database changes.
- Facing database performance metric and configuration problems, we can improve database health using a project-based approach and control database alert configurations in a unified manner.
- We can also look into SQL throttling and fast recovery. Slow SQL statements may result in the total unavailability of RDS. To ensure database stability, we are currently exploring ways to limit slow SQL statements through automated and semi-automated methods.
The following figure illustrates several approaches for managing database stability in ECS.
2. Alert Monitoring and Governance
Alert monitoring and governance are critical for the discovery of problems and faults. Especially in large-scale distributed systems, precise and timely monitoring can help R&D personnel identify problems as soon as possible and mitigate or even avoid faults. However, invalid, redundant, and inaccurate low-quality alerts not only waste time, but also affect the satisfaction of SRE on-call personnel and seriously affect fault diagnosis. The main causes of the low quality of ECS alerts include:
- There are many alerts, with up to 200+ per day and an average of more than 100, and many false positives.
- The many alert channels result in a large number of repeated alerts and high interference.
- Configuration exceptions produce a high risk of alert loss.
- Human resources are wasted because the repeated occurrence of alerts demands a large amount of inefficient work.
- The risks of CLI operations are high, and a large number of CLI operations increase production and O&M risks.
In response to the preceding issues, our strategy is to systematically organize alerts to ensure their authenticity, accuracy, precision, and high quality. Our approach involves the following steps:
- Delete invalid alerts and clear invalid alerts from the monitoring platform history to increase the authenticity of alerts.
- Optimize alert configurations.
- Unified alert recipients ensure that alerts are delivered only to the correct recipients.
- Optimize the alert levels to set rational alert levels.
- Configure alert channels by alert type and severity, such as critical phone alerts, severe SMS alerts, and common DingTalk alerts.
- Automate all alerts that currently require manual intervention so that alerts that are currently handled by personnel can be automatically resolved. For example, if the disk storage space is insufficient due to a large number of business logs, you can automate the operation by optimizing the log rolling policy.
3. Fault Diagnosis
We adopt a 1–5–10 model for fault recovery, which means that problems are detected in 1 minute, located in 5 minutes, and resolved in 10 minutes. The one-minute detection of problems depends on high-quality monitoring and alert systems mentioned previously, while five-minute problem locating depends on systems’ fault diagnosis capabilities. To quickly diagnose faults based on existing alert information, we must confront a series of challenges:
- Long system call links: From the console or API to the underlying virtualization or storage system, there are usually at least 10 layers of RPC links that rely on more than 30 systems. This makes it difficult to connect services in a series.
- Complex dependencies between systems: The ECS control system itself adopts a multi-layer architecture with mutual dependencies on the group’s order and billing systems.
- Difficult impact scope analysis: The effects of faults cannot be quantified.
At first, we divided the construction of the fault diagnosis system into three phases:
- Establish an end-to-end trace model: We must use trace IDs to establish call links and implement serial service connections.
- Fault diagnostic models for core application scenarios: We must train diagnostic models for current core business links and scenarios prone to faults to make a point-to-plane breakthrough.
- Fault impact scope models: These models automatically analyze the users, resources, and funds that are affected by a fault, facilitating rapid fault recovery and post-fault measures.
4. End-to-end SLO
No system has 100% reliability, but there is no essential difference between 99.999% and 100% availability for end users. Our goal is to ensure a 99.999% availability service experience through continuous iteration and optimization. However, the behaviors initiated by end users go through a series of intermediate systems, so reliability problems in any one system will affect the overall customer experience. However, we must find a way to measure the stability of all nodes. To this end, we have started the construction of an end-to-end SLO system. The main strategies are as follows:
- Identify upstream and downstream business dependencies and sign SLO agreements.
- Build end-to-end SLO visualization capabilities.
- Encourage upstream and downstream dependencies to meet SLO standards.
5. Resource Consistency Governance
The Consistency, Availability, and Partition tolerance (CAP) principles of distributed systems cannot be achieved simultaneously. As a large-scale distributed system, the ECS control system also faces a resource consistency problem. Specifically, data inconsistencies exist in ECS, disks, bandwidth, and other systems such as ECS control, orders, and metering. To ensure system availability, distributed systems usually implement final consistency at the expense of the real-time consistency of some data. Based on the technical architecture and business characteristics of ECS, we implement the following policies to ensure resource consistency:
- Data-driven: First, we establish an end-to-end visual reconciliation system to fully digitize all inconsistent resources.
- Finance and assets: We measure consistency issues from two perspectives, resources and asset losses.
- Offline (T+1) and real-time (hourly reconciliation): We use these two methods to stop losses in a timely manner.
SRE Process System Construction
In the context of concurrent R&D by nearly 100 people, the daily release of core applications, and several thousand releases throughout the year, ECS is one of the key factors that have allowed us to reduce the failure rate each year. It provides a complete set of R&D and process change assurances. The following section briefly describes some explorations the ECS SRE team has made concerning the R&D and process change systems.
R&D process: Increase the standardization of the R&D process throughout the software lifecycle.
1. Design Process and Specifications
From the perspective of software engineering, the earlier problems are addressed, the less labor required and economic losses. The subsequent maintenance costs of poorly designed systems are much higher than the costs of implementing a better design beforehand. To control the design quality, we have explored the following design processes and specifications:
- Strengthen the preliminary design and unify the design templates. The complete design should include the architecture design, detailed design, test cases, horizontal design, monitoring and alerts, phased release plan, release plan, and other plans.
- Design reviews are conducted using a parallel online (DingTalk live broadcasts) and offline model.
2. Code Review Upgrade
Previously, ECS Code Review was mainly deployed on GitLab. The main problem was that the integration of GitLab with the continuous integration of relevant internal Alibaba components was not stable, and scheduled admission could not be set. The Aone Code Review platform solves the integration problem with the Aone lab and provides a scheduled configuration function for code integration. In addition, we have defined a code review process for ECS control, as shown below:
- We have implemented unified R&D specifications, including development environment specifications, coding specifications, and group development specifications.
- A code review checklist is provided.
- Git commit allows the association of issues, so that code can be associated with requirements, tasks, or defects, and then the issues can be tracked.
- Unobstructed static scanning is implemented.
- The UT pass rate is 100%, and the coverage rate is not lower than that of the trunk (focus on the UT coverage rate).
- The code specifications comply with Alibaba code specifications.
- Review of key business points is implemented.
- MR should provide standard reporting information and describe the change content.
3. End-to-end CI Standardization
We have migrated all the core ECS control applications to the standard CI platform in a unified manner and in accordance with a standard mode. This greatly improved the CI success rate and reduced the waste of human resources due to the need for frequent manual intervention. Our solution is as follows:
- Provide a standardized CI access mode and CI pipeline.
- Prepare environment.
- Run unit tests.
- Run coverage analysis.
The UT parallel running mode was transformed to improve the UT running efficiency.
4. End-to-end Daily/Isolated Environment Interconnection
The ECS deployment environment is extremely complex. Specifically, the deployment architecture is complex and there are many deployment tools and dependencies. The environment is dependent on all the core middleware and applications of Alibaba Group and Alibaba Cloud. By adopting an approach where nearly 100 people perform R&D in parallel, the stable and reliable end-to-end daily environment is the basic guarantee of R&D efficiency and quality. The transformation of the end-to-end daily environment cannot be achieved overnight. Our current construction approaches are roughly as follows:
- Comprehensive containerization supports both daily and isolated environments.
- Third-party services depend on Mock.
- The end-to-end test environment is interconnected.
5. Staging Environment Usage Specifications
The staging environment and production environment use the same database, so staging testing can easily affect the stability of production services. Given that data cannot be isolated between the staging and production environments, our short-term solution was to improve the quality of the staging code through standardized processes to minimize or avoid such problems.
Staging is equivalent to production. Staging can only be deployed after passing CI and basic daily verification.
DDL and large table queries can be staged only after being reviewed. This prevents slow staging SQL statements from impacting RDS stability and affecting the production environment.
6. FVT Release Access
We verify the stability of staging code by running API-based functional test cases early each morning. This is the last line of defense for daily release access. The 100% FVT pass rate goes a long way to ensure the success rate of the daily releases for ECS core control.
7. Exploration of Unattended Release
In the current release mode, the on-duty staff pulls release branch deployment staging based on the Develop branch the night before the release. On the day of the release, we check that the FVT success rate is 100% and then release the branch in batches through Aone. The staff observers business monitoring indicators, alerts, and error logs for each batch. In this mode, core applications are released every day, and the process takes about half a man-day of work. To further improve the efficiency of human resources, we have looked into automated release processes:
- Assembly line automated staging deployment.
- Automated release access verification: The FVT success rate is used for automated release based on business rules.
- Unattended release: This is the ideal release model, with automated release after continuous integration and when all release access card points pass verification.
Change process: The change efficiency is improved while ensuring change quality by standardizing the change process, connecting to GOC for strong control, change white screen, and automating the process.
- Standard Control Process Definition
We ensure that all changes can be monitored, pre-released, and rolled back by restricting existing control change behaviors such as hot upgrades, configuration changes, DDL changes, constraint configuration changes, data correction, and CLI operations.
- Strong Control Access
By connecting to the strong control of the group, we can ensure that all changes can be traced and reviewed. We also hope that the platform can be connected to strong control system to eliminate cumbersome manual change work.
- GUI-based Development
The integration of all ECS resources, management systems, diagnostics, performance, O&M, visualization, and the ECS Operations and Maintenance System capabilities creates a unified, secure, and efficient O&M platform for elastic computing.
- Change Automation
We will automate all the cumbersome tasks that require manual intervention.
Stable Operation System
During the construction of the stability system, the capacity and performance optimization of basic components, the construction of the end-to-end stability system, and the upgrade of the R&D and change processes provide the foundation for stable operations. The establishment of a culture and continuous operation are indispensable for long-term planning and efficient operation. The following are some of the approaches the ECS SRE team has taken to a stability operation system.
On-call rotation: Google SRE adopts a 24/7 on-call rotation system, which is responsible for monitoring and handling production system alerts and performing firefighting. SREs are essentially software engineers. In the ECS control team, each SRE must handle online alerts, respond to emergencies, and participate in the troubleshooting of difficult problems while performing R&D work. In order to ensure that the core R&D work of SREs is not interrupted, we are trying to implement an on-call rotation mechanism.
- On-call responsibilities
- Monitoring and alert handling: The staff must monitor and handle production environment alerts in a timely manner.
- Firefighting: The staff must coordinate with business teams to solve stability problems in the production environment.
- Stability problem troubleshooting: The staff must investigate potential stability risks in the production system in an in-depth manner.
- End-to-end stability inspection: This includes the inspection of SLO metrics, error logs, RDS health status, and slow SQL statements for core services of the production system.
- Participation in fault replay: These faults include GOC faults and online stability problems.
- Experience output: The staff must summarize the fault diagnosis, troubleshooting, and fault replay experience gained in the on-call process.
- How Can New Employees Quickly Join On-call
- Based on on-call templates, new personnel can follow the instructions and accomplish clear objectives.
- New employees can use the on-call knowledge base and training manual.
- The new employees participate in rotations to learn from practice.
How is fault replay performed?
The fault replay mechanism replays issues that occur or affect internal stability after the event. In ECS, problems affecting production stability are uniformly defined as “internal faults”. Our view is that all “internal faults” have the potential to become real faults, so they must receive sufficient attention. For this reason, we often communicate and cooperate with the group’s fault team, and have studied internal fault replays and management modes. The following describes some basic concepts of fault replay and some ECS control practices in fault replay. Fault replay is not intended to assign blame, but to discover the underlying technical and management problems behind the fault symptoms.
- Avoid accusation
- Focus on events, not people
- Reward people who do the right thing
- Collaborate and share knowledge
- Fault Replay Method
- The responsible person fills out the fault replay report.
- SREs and stakeholders participate in the review (hold offline meetings in the case of serious faults).
- Fault stakeholders ensure the implementation of the fault action according to the ETA.
- Fault Replay Document Library
The summaries of fault replays are an important knowledge asset. Internally, we produce an in-depth summary of each fault replay and form the internal knowledge base “Learn From Failure”.
- Stable Daily Operations
Stability itself is a product that requires daily and continuous operation. The main ECS control modes include daily stability reports and biweekly stability reports.
- The daily stability reports and the T+1 FBI reports summarize core metric data through the entire system, such as workflows, API success rates, resource consistency, and asset losses. These reports are mainly used to promptly detect system risks and promote solutions.
- Biweekly stability reports are published every two weeks by email. They provide periodic summaries of end-to-end stability issues, including faults and internal stability problems, as well as announcements of core issues, and analyses of core link indicators.
My Understanding of SRE
The previous content briefly introduced some practical experience of the ECS SRE team. As an SRE, I have participated in ECS stability governance and R&D work since 2018. Next, I will share some of my thoughts about SRE practices over the past year. These are simply my own personal opinions.
Several Misconceptions About SRE
1. SRE is O&M
SRE is more than just O&M. It is true that, in some companies, the responsibilities of SREs are similar to traditional O&M or system engineers. However, generally and certainly in the future, SRE is a position that demands a wide range of skills beyond O&M capabilities, such as software engineering, technical architecture, coding, project management, and team collaboration.
2. SREs Do Not Require Business Skills
If you lack a business architecture, you lack a soul! SREs are unqualified if they do not understand the business. SREs must participate in the optimization and future planning of the technology and O&M architecture. At the same time, they should coordinate with the business team to perform troubleshooting and solve difficult problems. These tasks cannot be performed well without a clear understanding of the business.
SRE Capability Model
When addressing the misconception that SRE is nothing more than O&M, I mentioned that SREs require a wide range of capabilities. I now want to present my idea of an SRE capability model for the future. This is only a preliminary idea to be used for reference.
1. Technical Capabilities
- R&D capabilities
Business team SREs must first possess R&D capabilities. For instance, elastic computing SREs need to develop common middleware components, such as workflow frameworks, idempotence frameworks, cache frameworks, and data cleansing frameworks. R&D capabilities are the most necessary skills of SREs.
- O&M capabilities
SRE evolved from O&M in the DevOps development process. For both manual and automated O&M, SREs must possess comprehensive O&M capabilities. In the elastic computing team, SREs are responsible for ensuring the stability of the production environment (networks, servers, databases, middleware, and so on). During daily on-call and fault emergency response work, O&M capabilities are essential.
- Architectural capabilities
SREs should not only focus on the current stability and performance of the business, but also plan the capacity and performance of the business from a future perspective. This requires a familiarity with the business system architecture and excellent architectural design capabilities. As an SRE for elastic computing, one important task is to take the technical architecture as a future plan and provide an executable roadmap.
- Engineering capabilities
Here, engineering capabilities mainly refer to the ability to implement software engineering and reverse engineering. First, SREs must be able to think like a software engineer and implement large-scale software engineering tasks. In addition, one of the core daily tasks of an SRE is the handling of stability problems and other difficult problems. Reverse engineering capabilities play a critical role in troubleshooting abnormal problems in a large-scale distributed system, especially when handling unfamiliar problems.
2. Soft Skills
- Business capabilities
An SRE who does not understand the business is not qualified to be an SRE. In particular, business team SREs can better carry out architecture planning and troubleshooting only when they are familiar with the business’ technical architecture, development status, and even the details of the business modules. For instance, an elastic computing SRE must be familiar with the current elastic computing business dashboard, future development plans, and even the business logic of the core modules.
- Communication skills
As an engineer, there is no doubt that communication skills are essential. Most of the work done by SREs involve different teams and even different business units, so communication skills are particularly important. In the elastic computing team, SREs must communicate and cooperate closely with multiple business teams to ensure business stability. Externally, we must cooperate with the group’s unified R&D platform, basic O&M, monitoring platform, middleware, and network platform teams. Sometimes we even have to directly interact with external customers. Therefore, I cannot stress the importance of communication skills enough.
- Team collaboration
SREs must appreciate the importance of team collaboration, especially in the case of fault emergencies when we must cooperate closely with multiple teams to reduce MTTR faults. During daily work, SREs must actively coordinate with the business team and external dependent teams to guide and promote the performance of stability-related work.
- Project management
The work of an SRE is technically complex and the transactions are cumbersome. When the daily on-call and firefighting responsibilities are added, project management is very important from the team perspective. This ensures that all the work can be carried out in an orderly and healthy manner. From a personal point of view, time management is extremely valuable. In my own elastic computing SRE team, we carried out several small projects over the past year to ensure the rapid implementation of the stability system, and the results were very good. Currently, we are managing virtual organizations and long-term projects.
3. Ways of Thinking
As mentioned earlier, SREs require team collaboration and engineering capabilities. At the same time, SRE personnel must upgrade how they think. For example, they should be able to think in reverse, have an awareness of cooperation, show empathy, and quickly adapt to new situations.
Core Concepts of SRE
From my own experience, I think that these are the core concepts of SRE:
- We use a software engineering methodology to solve the stability problems of production systems.
- We try to automate all the tasks that consume the team’s time.
- We view stability as a product.
- We believe team collaboration and empowerment are critical.
- We do not look for silver bullets, but seek solutions suitable for businesses and teams.
- If we prioritize the 20% of our work that is most important, we can solve 80% of core problems.
The Spirit of an SRE
We are the last line of defense. Site reliability engineers must have a strong sense of responsibility and mission. As the guardians of stability, we should be fearless and determined in the process of team collaboration.
- We must bravely face challenges. First, site reliability engineers should always focus on stability, our ultimate goal, and have the courage to refuse any behaviors unrelated to this. Second, we should look at problems from the perspective of the future to bravely innovate and face challenges.
- We must stand in awe of production. site reliability engineers are the guardians of the production environment, but can also be its destroyers. The organization gives site reliability engineers wide authority to change the production environment, but this power comes with great responsibility. More than anyone else, site reliability engineers must be careful and cautious.
- We must embrace risks. No matter how professional and cautious we may be, faults will definitely occur. site reliability engineers view risks as learning opportunities, face up to risks in a scientific manner, and avoid failure by constantly learning to respond to risks.
In this era of explosive information growth, technology is developing rapidly. Technicians must not only maintain their enthusiasm for technology, but also have the ability to think. There are no universal solutions, only provisional solutions tailored to local conditions and constraints. We can expect the road ahead to be difficult and full of twists and turns.