By Shanlie, Alibaba Cloud Solution Architect
As reported by the “Research Report on Chinese Children’s Programming Industry” and the “Analysis and Forecast Report on Chinese Children’s Programming Market for 2017–2023”, programming among children is so promising that it is expected to reach up to RMB 50 billion within 3–5 years.
In today’s information age, artificial intelligence has brought many changes to society. Parents in the Internet era are different from parents of the previous generations. They pay more attention to children’s quality education and their competencies in artificial intelligence. Therefore, children’s programming education has developed rapidly.
Walnut Programming has taken the lead in the children’s programming education industry. It is committed to promoting programming education through science and technologies. It also aims to inspire Chinese children to learn through advanced technologies and scientific educational strategies, such as artificial intelligence and adaptive learning. Since August 2017, Walnut Programming has enjoyed rapid business development, with the number of paid students exceeding 2 million and the monthly revenue exceeding RMB 100 million within three years.
With the rapid growth of the Walnut Programming business, the system scale and complexity of core applications are also undergoing many changes. The Walnut Technical Team has been maintaining the technological advancement of the entire system architecture continuously through emerging technologies. Within three years, the Technical Team has had at least six major restructures of the overall system architecture, involving important technologies, such as microservices, containerization, and distributed database. The team has also tried to improve the elastic scalability of the system through Serverless. During the pandemic, Walnut Programming’s system architecture made it through the sudden upsurge in the system workload.
As the system architecture becomes more complex, a long-standing problem in the Internet field has also been presented to Walnut Programming, “How can we improve the observability of a distributed system?” In online programming teaching scenarios, a simple operation by users may involve multiple interactions between the frontend and backend systems and the calls between multiple microservice applications on the server, which may be affected by third-party service interfaces. Any link failure or performance bottleneck will lead to a drastic decline in user experience. As the user experience is the core element of the brand image, there are several requirements that the Walnut Technical Team has to meet during system observability construction to guarantee an excellent user experience:
- Comprehensive knowledge of the performance and quality of each external system interface in real-time
- Grasp system health that end users perceive during the interaction with the system through data
- When the system is unhealthy, the Technical Team needs to locate the problem and seek a solution in a timely manner.
- When solving problems, the Technical Team needs to locate system bottlenecks and failure sources quickly.
It is very difficult for any Technical Team to build a distributed observability system from scratch centering on these aspects. Fortunately, there are many mature methodologies and open-source projects of distributed observability construction for reference in the industry.
The observability widely recognized by the industry consists of three core elements: Logging (discrete log information), Metrics (aggregated indexes), and Distributed Tracing. Centering on these three core elements, many open-source projects can help developers build a distributed observability system quickly.
The Walnut technical team has established a complete distributed observation system using open-source technologies, such as Skywalking and Prometheus. It can implement full-procedure tracking for complex microservice applications on the server and perform the collection and analysis of business logs through the unified log service system. By doing so, the system stability and user experience can be improved. For any link failures of the system server or performance bottlenecks, you can notify the Technical Team immediately and locate the problem for a quick solution.
The best way to build a frontend system with observability is to choose a complete solution provided by a cloud computing vendor. For years, Alibaba has formed a unified frontend monitoring solution available to all internal business departments. For the frontend applications in the form of HTML pages, whether on a PC or mobile website, the HTML5 page embedded in the mobile app can be connected to this frontend monitoring solution in a non-intrusive way.
This monitoring solution is also provided externally through Alibaba Cloud. It has become an important part of the overall observability solution of Alibaba Cloud, serving external users.
There are two client-side monitoring products, including ARMS frontend monitoring and APP monitoring. ARMS front-end monitoring focuses on web-based experience data. It monitors the health of web pages, including page loading speed, page stability, and the success rate of external service calls. It helps reduce the page loading time, JS errors, and improves the user experience.
This solution can make up Walnut Programming’s weaknesses in the client-side monitoring. Therefore, the Walnut Technical Team has tried to connect the Alibaba Cloud ARMS frontend monitoring to some businesses. Not long after, the benefits brought by this solution in improving the user experience have gradually shown themselves.
Metrics, such as first paint time, first meaningful paint, and Dom Ready are unique performance metrics of HTML pages, which follow the business metrics definition. These metrics are closely related to the health of the frontend pages and affect the interactions between each end user and the system.
The waterfall plot of page loading shows the response time in each stage based on the page loading order. These metrics include the performance metrics of the network. Performance bottlenecks on the network, for example, the access bandwidth of an application system unable to support the user access traffic, cannot be detected only by server-side monitoring. Instead, the client-side real-time monitoring data is needed to report such bottlenecks. Through ARMS frontend monitoring, Walnut Programming can grasp the end-to-end health of each application system during page production (server-side state), page loading, and page running.
ARMS frontend monitoring can aggregate and analyze performance metrics based on geographic location, browser, operating system, resolution, network operator, and application version to help Walnut Programming better locate performance bottlenecks. For example, the geographical distribution view can show the average first paint time of pages in each province in China through aggregation analysis of geographical locations. When the CDN of a region fails, the geographical distribution view can help Walnut Programming locate the cause of the problem quickly. On the contrary, all these scenarios cannot be implemented by traditional monitoring.
After mastering the frontend observability provided by ARMS, Walnut Programming used the frontend page health metrics as the detection standard for daily business iteration. It is carried out in combination with the gray release plans of all business lines. Each version upgrade of the production environment will be implemented through gray release by Walnut Programming. First, small-scale user traffic is imported into the new version for verifications on functionality, stability, and health. The user traffic imported to the new version will be increased gradually only when the predefined metrics are met. Otherwise, the version is rolled back immediately. Frontend health metrics are very important and cannot be fully collected simply through common tests before releasing the new version. Walnut Programming incorporates the frontend health into the measurement of business iteration, reflecting the grayscale, observability, and rolling back in the process of business iteration. These are also the three widely promoted principles for production safety in Alibaba.
In addition to grasping various frontend service metrics through active observation and analysis in the ARMS console, the more important task is to obtain timely notifications and alerts when user experience problems occur for prevention. This can be implemented easily through the perfected alert mechanism of ARMS. Based on its understanding of frontend health and the industry-wide universal methodology, Walnut Programming has created alarm rules of various dimensions, such as “the average response time for first paint in the last five minutes is greater than one second.” When a rule is triggered, the system sends an alert notification to the specified contact group in the specified alerting mode, informing the technical team to take timely actions to solve the problem. Together with the grading and classification of production failures, these alarm rules can help the Walnut Technical Team establish a complete set of response mechanisms for production failures. By doing so, the online problems can be discovered within 5 minutes, isolated within 10 minutes, and solved within 30 minutes.
Walnut Programming also actively explores the unified procedure tracking technology between the frontend and backend. It connects in series the procedures in which API requests are sent from the frontend and called in the backend and reproduces the complete code execution scenario. This is achieved by automatically injecting Trace information into the frontend API request. When the API automatic report is allowed, ARMS frontend monitoring can add the automatically generated TraceID to the Request Header of the API request as the identifier for connecting the frontend and backend procedures. With the call timeline, it can figure out whether the network transmission or backend call causes too much request time. With the thread profiling function of backend applications, the complete backend call procedure of each request can be examined clearly. This is very helpful for troubleshooting system failures and performance bottlenecks.
The improved frontend system with observability helps reduce the O&M workload of Walnut Programming by over 30% and shortens the average time for failure locating by over 60%. It improves the user experience significantly and lays a solid foundation for sustainable business development. The Walnut Technical Team will continue to explore more cutting-edge cloud-native technologies based on their technical characteristics and the benefits of cloud computing.