Serverless Assistance in “Top War”
By Shanlie, Wang Yongmeng, and Zhang Yu
RiverGame Co. Ltd is an emerging mobile gaming enterprise. Since its establishment in 2018, it has faced the global gaming market and had a place in the highly competitive gaming market by creating interesting game experiences. After only two years, RiverGame Co. Ltd has become one of the top 30 Chinese game manufacturers, with the majority of their success in overseas markets coming from their game, Top War. As indicated by their slogan, better Chinese games, RiverGame Co. Ltd is enriching its game categories and hoping to bring more happiness to worldwide players.
The system scale and complexity of game servers are undergoing major changes because of the rapid growth of the business. Fortunately, RiverGame Co. Ltd has a small (but powerful) technical team that has always been exploring all the cutting-edge technology in their field. It also adopts multiple methods to improve the system architecture to better support business needs and reduce IT costs.
There is a very important task in the multiple iterations of the technical architecture, which is to abstract and separate the common business capabilities in game scenarios from the main server of the game to the unified service layer. Business capabilities separated from the main server include account management, Instant Messaging (IM), content security, membership system, information push, and game behavior analysis. This separation reduces the business complexity of the main game server and implements the support for the core game scenarios on the main server. In addition, common capabilities can be reused in multiple game categories. This operation reduces R&D costs and improves R&D efficiency.
The capability splitting and the decrease in the business coupling degree have facilitated continuous iterations and pre-research on new technologies. This has also created an opportunity for RiverGame to explore the cloud-native Serverless field in depth. Serverless architecture can give full play to the rapid scalability of computing resources and is an important development direction of cloud computing. In games, the main game servers provide complex core business logic that requires long-term operation and low-latency data interaction among multiple player terminals. Therefore, virtual machines or containers are needed for the main servers. The business scenarios surrounding the games separated from the Top War main server have become the first choice for piloting the Serverless technical architecture.
New Requirements for Online Translation
The online translation service was the first scenario piloted in Serverless, which is related to the company’s globalization strategy. The enterprise’s typical work on Top War is a game facing the global market that is attracting worldwide players. When entering the game interface, there are players from different countries discussing various game-related topics in different languages.
In this business scenario, worldwide players are brought together through a simple online translation function that offers an excellent user experience. The simple and easy-to-use design is also one of the reasons why Top War has been highlighted repeatedly by players in various major app markets.
It is impossible for RiverGame to develop a real-time translation tool that supports dozens of languages from scratch. Fortunately, the communications between game players are often brief, so the translation results do not need to be completely accurate. The timeliness of backend processing is the real focus. Through a simple preprocessing of players’ requests, the translation work can be forwarded to a third-party platform to complete since platforms like Google Translator have already provided powerful online translation capabilities.
This is a simple function, but it still faces certain challenges during the implementation of the technical architecture. The number of online players in each period is not the same, and there are peaks and valleys. When the number of online players is relatively large, there will be a large number of chats. Moreover, the number of chats is not proportional to the number of players online. When encountering some hot events, heated discussions among global players will be triggered, and there will also be an upsurge in the number of messages requiring online translation. Thus, scalable architecture is required to process players’ translation requests.
In this architecture, the main applications written in the PHP perform shows a series of preprocessing on players’ translation requests, including replacing symbol codes and filtering sensitive content. Then, the requests are forwarded to a third-party translation platform to obtain the translation results. This is a widely used technical architecture with high concurrent processing capabilities. In the era of cloud computing, based on the scalable characteristics of cloud resources, the throughput of the entire cluster is dynamically adjusted as business volume changes. However, from the perspective of cloud-native, there are still some imperfections in this architecture while running in a large-scale production environment:
- Heavy Maintenance Workload: The maintenance process involves virtual machines, networks, SLB components, operating systems, and applications. It takes a lot to ensure the high availability and stability of the system. For example, when a certain application instance fails, how can it locate the failure and then remove it from computing clusters as quickly as possible? You will need a complete monitoring mechanism and a fault isolation and recovery mechanism to solve this problem.
- Delayed Scalable Capability: Refined management cannot be achieved to trigger scalability based on actual requests. It must happen through timing tasks or metric thresholds, such as CPU utilization and memory usage. When the chat requests increase sharply, delayed scalability will occur. Even if it is optimized using technologies, such as Kubernetes and the reserved resource pool, it often takes several minutes to scale up a new instance.
- Low Resource Utilization: Delayed scalability leads to relatively conservative scaling policies, which reduces resource utilization and increases resource costs.
Advantages of the Serverless Solution Based on Alibaba Cloud Function Compute (FC)
Is there any solution that can help the Technical Team focus on the implementation of the business logic and refine resource allocation based on the real-world requests from players to maximize resource utilization? As cloud computing develops rapidly, major cloud manufacturers are actively exploring new solutions to solve cost and efficiency problems in a more “cloud-native” manner. The Serverless solution based on Alibaba Cloud FC is prominent in this field.
FC is an event-driven and fully managed computing service. When using FC, developers only need to write and upload code without infrastructures, such as management servers. FC automatically prepares computing resources, running business logic in a scalable and reliable manner. Meanwhile, FC provides additional features, such as log query, performance monitoring, and alerting, to ensure stable system operations.
Compared with the traditional application server that provides services externally while running, the biggest difference is that FC pulls up computing resources as needed for processing tasks and recycles computing resources automatically after the tasks are completed. This solution lines up with the Serverless concept, which maximizes resource utilization, reduces system maintenance workload, and usage cost. Since there is no need to apply for computing resources in advance, users do not need to consider capacity or scalability with the pay-as-you-go model.
Implementation of Serverless in the Game Field
For simple business logic implementation, such as online translation, it is easy to migrate from the traditional architecture to the Serverless architecture. Each translation request by players is processed as an FC task. The specific procedure pulls up the corresponding computing resources for processing and releases the resources automatically after the task is completed. Since the RiverGame Team is most familiar with the Java language, the team uses the Java language to implement online translation in the transforming process to Serverless, making full use of the various ecological capabilities of the Java system. No specific development languages or specific business logic are designated by FC. It supports all mainstream development languages. After the transformation by Serverless, the online translation system architecture becomes simpler.
Functions configured with HTTP triggers can respond directly to requests sent by players and schedule corresponding computing resources to process in a scalable and reliable way. Since the task allocation in FC can fully match the user traffic changes in the frontend, useless SLB can be removed from the architecture. Additionally, the long-running application clusters are no longer needed. FC platforms can pull up a large number of computing resources quickly to execute tasks concurrently and ensure the high availability of the entire architecture. In this process, Redis caches some simple statements with high frequency to reduce dependencies on third-party platforms. The biggest surprise brought by this architecture is that the team no longer needs to carry out capacity planning and auto scaling management. Therefore, they can focus on meeting their business needs and achieving business innovation in more fields.
Compared to Node.js or other languages, Java instances take a longer time to initialize and load categories. FC enables computing resources to be pulled up in milliseconds through various optimizations. However, it often takes a few seconds for a Java program to run, which is not good for delay-sensitive services, such as online translation. The solution proposed by Alibaba Cloud is to use a single-instance with multiple-concurrency and reserved instances to solve problems of delay-sensitive businesses.
Each FC instance pulled up through single-instance with multiple-concurrency can process up to 100 tasks concurrently, reducing the average execution time, costs, and the probability of a cold start. With reserved instance optimization, FC allocates computing resources in advance based on the function load changes. As a result, the system can use reserved instances to process requests during the expansion of on-demand instances, eliminating the delay caused by cold starts completely.
The transformed online translation service uses a Serverless architecture with on-demand computing resources, making full use of the scalability of cloud computing. In terms of cost, since the application no longer needs to run for a long time to provide external services, the use of cloud resources can be fully matched to the real-world changes in business volume. Thus, the average resource utilization can be improved significantly. In terms of system throughput, FC can pull up the computing resources with tens of thousands of instances in a short period and support massive concurrency during peak hours or when user requests surge. Moreover, there is no need for preliminary work in the capacity assessment. In terms of system maintenance, there is no need to reserve computing resources or maintain the underlying software and hardware, which reduces the operational costs significantly. Thus, the RiverGame Technical Team RiverGamecan focus on the implementation of complex business logic and technological innovations. In online translation scenarios, the Serverless solution based on FC saves more than 40% of IT costs compared with traditional architectures.
Another R&D efficiency improvement is the version and alias management feature provided by FC. The version is equivalent to a service snapshot that helps users release one or more versions of the service using an alias. This enables continuous integration and release in the software development lifecycle and implements grayscale iteration of services conveniently.
In later architecture optimizations, RiverGame Co. Ltd will try to preprocess the original content as much as possible using machine learning technology to reduce dependencies on third-party platforms. In the AI inference field, the advantages of Serverless architecture can also be applied to schedule large amounts of computing resources in a short time for large-scale concurrent processing through pre-trained deep learning models.
After the successful pilot of Serverless in online translation scenarios, RiverGame continues to explore scenarios that match Serverless in more business areas. Now, Serverless technology has been introduced into fields, such as push services, content security, and game behavior analysis. In the future, RiverGame will continue to explore the Serverless architecture based on its technical characteristics, so it can enjoy the benefits of cloud computing while embracing new technologies.