Fighting Coronavirus: Freshippo Reveals 12 Key Technologies to Achieve 0 Faults over a Year (Part 1)
By Zhang Peng (nicknamed Zhangpeng)
Alibaba’s self-operated grocery chain, Hema Fresh, also known as Freshippo, has become a lifeline for many local residents in China during the coronavirus outbreak. Freshippo committed to a policy of remaining open for business, not raising prices and remaining stocked, particularly in the 18 stores across Wuhan. Additionally, Freshippo, together with food delivery chain Ele.me, teamed up with local restaurant chains to provide free meals and necessities to the hospital staff in Wuhan and emergency response teams.
Despite facing such unprecedented scale of disruption, Freshippo was able to carry out its business as usual, thanks to its robust technology. Even when faced with the outbreak, Freshippo adapted to local shopping habits and needs, and were able to expand its digitally enabled grocery-shopping experience to more segments of consumers, all with the same mobile app. In this blog, we’ll cover 12 technologies that let Freshippo achieve 0 faults throughout an entire year.
1. Stability Is King
Freshippo, has demanding stability requirements for offline operations. If the store’s point of sale (POS) machine cannot accept payment, long lines for payment can make customers upset. As a result, if the deliveryman cannot collect the ordered items, hungry customers who ordered lunch and have growling stomachs will have the phones ringing off the hooks, and even leave negative reviews online. Therefore, for Freshippo, safe production is crucial and stability is most important.
Freshippo’s intelligent delivery scheduling is responsible for assigning orders to the deliveryman. It is the first practical link for delivery, and is the most important link of the delivery operation. If scheduling problems occur, many orders could not be handled in a timely manner. This will cause customers to complain and cancel their orders, leading to revenue loss. Theoretically, making no changes to the system reduces the stability risk. However, given the needs for improved efficiency and reduced costs in Freshippo’s delivery, as well as the development and O&M difficulties caused by the patching of intelligent scheduling in the past two years, restructuring the system architecture becomes urgent. This puts us under great pressure to continuously restructure and transform the system, support the growing business and expanding business requirements, while improving operating efficiency and reducing costs.
The intelligent delivery scheduling system was completely restructured this year, migrating important businesses such as O2O, B2C, and cold chain, and raising new requirements for mall scheduling. This restructuring supports the efficiency improvement and cost reduction requirements of pre-scheduling and real-time capacity scheduling, and supports the white-box transformation of algorithm data. On this basis, built further to produce a strategic operation plan, a simulation transformation plan, and an intelligent troubleshooting plan for algorithm results. Our practical application shows that, while the system is rapidly developing, the intelligent scheduling system was also able to achieve zero faults throughout the year.
2. Analysis on the Intelligent Scheduling Link
The premise of ensuring system stability is to identify the key links of the system, including the external dependencies and the services that we provide. Therefore, when taking over the upgrade and transformation of the intelligent scheduling system, we conducted a comprehensive analysis on the intelligent scheduling links, and enhanced these links right after new requirements were raised. This gave us a comprehensive picture of the system, whether during promotional periods or daily monitoring, whether of our own O&M or the O&M backup for partners. In this way, we can have discussions based on the unified link diagram, and troubleshoot and solve problems when alerts occur.
Delivery O2O smart scheduling involves many elements, such as the scheduling system, pressure system, basic information, deliveryman platform, algorithm strategy, distributed computing, path planning, and document distribution. It also involves storage components and middleware such as Data Transmission Service (DTS), Diamond, Tair, job databases, and degradation databases. Given that the link is very long, we have drawn an O2O intelligent scheduling sequence diagram. By using this comprehensive diagram, system stability risks can be evaluated before major promotions and system changes in terms of product, technology, testing, and algorithm.
3. Stability Factor Analysis and Relevant Practices
With detailed call links of the system, we can sort out the stability of each thread in the links by type.
3.1 Database Dependency
(1) Slow SQL statements
Database dependency analysis mainly analyzes the stability of the dependent database. To do this, we first check whether the database has slow SQL statements. Most faults in the early days of Freshippo were caused by slow SQL statements. Later, the centralized management of the database gradually collapsed this unstable factor, but managing slow SQL statements is a long-term task. Whether it is the SQL pre-analysis of new businesses, or a check of the database required for natural traffic increases, it is important to check the database on a regular basis.
(2) Large number of logical reads
Some problematic SQL statements are not slow ones, but involve massive logical reads, for example, reading more than 100 thousand rows. These SQL statements are prone to develop into slow ones when the business grows naturally. In this case, If this is the case, failing to take care of them early, they will eventually take your extra attention.
(3) CloudDBA properties, such as the CPU, memory, load, and QPS trends
Check whether the CPU usage of the database is normal, whether the load is high, and whether a lot of QPS or TPS spikes occurred. Before a big promotion, the walle library that the intelligent delivery scheduling depends on found that the overall water level of the database was low, but that the CPU spike is particularly high (reaching almost 60%) and was very regular. After troubleshooting, we found that this was caused by a non-discrete grid job. Fortunately, this issue was raised to the DBA. If the Double 11 traffic had tripled, the database would have failed to bear it. Thanks to the aforementioned action, the issue was discovered in time, and an emergency release for grid task discretization was implemented.
(4) Non-isolated databases
When databases are not isolated, you can often see core and non-core data together. High non-core QPS or high logical read performance affects the stability of the entire database. As a solution, delivery is divided from one library into core, secondary core, non-core, and archive libraries to achieve layered business priorities.
Cold and hot data are not isolated. A report generated by an infrequently accessed ultra-slow SQL statement uses the core library. This case has occurred many times in the history of Freshippo. For example, statistics queries and report exports often use the core library.
Reads and writes are not separated. The aforementioned report requirements, front-end display requirements, and dashboard requirements belong to read business, whereas the core of offline operations is writing databases. As a result, a problem occurs when reads and writes are not isolated.
(5) Database degradation
In the smart scheduling system, the database depends on the receipts in the core job database. The stability of core jobs is higher than that of smart scheduling. To ensure the stability of core jobs, we have configured a degradation policy for the database, which uses Data Replication System to synchronize a copy of data to the read database. Database degradation measures the impact of database instability on the services in the database.
(6) Inconsistent sizes and types of the same business field in the upper stream and downstream
If the upstream and downstream databases use the same business field but have different capacities, this can lead to extreme scenarios in which data cannot be inserted. The delivery-related delivery note field is a combination field that contains all delivery orders associated with a batch. Normally, one batch is associated with 1 to 3 invoices. During the big promotion, for convenience, 7 invoices were associated. As a result, the associated invoices could not be written due to the resulting string length of the field that exceeds 64 bits. At last, this problem was corrected by changing the length of the field to 256 bits. The field for product weight delivery is of the int type, which indicates milligrams. During a major promotion, the personnel in a warehouse entered the weight of one bottle of beverage as 2 tons (at that time, no business application scenarios involve the product weight, which was actually a reserved field.) Then, the delivery sub-order table stored the sub-orders for multiple bottles of beverages. As a result, the delivery system failed to create a waybill.
The same business field must be consistent in the upper stream and downstream. If escaping is performed on a field, you need develop field truncation, alerting, and emergency plans to quickly solve the exception while protecting yourself.
(7) Database capacity, indexing, and so on
We recommend for the DBA that the number of entries in a single table is less than 5 million, and delete unused index entries to avoid any impact on writing performance.
(8) Results from database changes
Modifying the index, modifying the type of a field, and expanding a field may lock the table or even affect the primary-secondary synchronization. If business reports happen to depend on the secondary database, the business report will be unavailable. No stop time is set for a data structure change that was made in the early morning, resulting in continuous database changes until the morning and affecting the business. The gmt_modified time is changed in batches by a data change. As a result, re-indexing occurs to all index entries that map to this field, affecting database performance and causing delay for all dependencies on Data Replication System.
To assess the impact of database changes, the best practice is to ask the DBA to clarify the details that you are not sure about, and consult with experienced colleagues for making appropriate decisions.
(9) The DB table is not in UTF8 mtd format
After you store a text in emoji format into a non-mtd table, querying the emoji returns no results.
3.2 HSF Dependency
(1) HSF service timeout
The HSF timeout period cannot be too long, especially for APIs with high QPS and high availability. A long HSF timeout period will fill up the HSF thread pool. In data collection by smart scheduling algorithms, due to the high QPS, simultaneous access to ADB is prone to jitters. The timeout period of HSF services is set to 3 seconds by default, which makes the HSF thread pool full, and the data collection function is unavailable in the pre-release environment.
(2) The retry mechanism upon HSF timeout
You can retry to avoid HSF timeout caused by network jitters. A relatively short timeout period and the retry mechanism are more reliable than the default timeout period. Due to the default 3-second timeout with no retries, the failure rate is about 25 out of 5 million requests per day. After using the 500-ms retry + 2 more retry mechanism, the failure rate dropped to once a week.
(3) Service caching
Pre-caching is required for accessing interfaces with relatively stable data and take a long time, to reduce access dependencies while ensuring stability. For highly dependent interfaces, post-caching is required for any applicable ones. When a service access attempt fails, the system requests cached data. Every time the access request succeeds, the system updates the cache.
(4) Service degradation
For highly dependent interfaces, you need to set up a degradation mechanism when the service is unavailable, such as degrading to the preceding cache, or degrading to other interfaces, to Diamond, or to memory. You must ensure that the service link is accessible, and work with the interface alerting mechanism to identify the service dependency issue in a timely manner.
(5) Service isolation
Core applications cannot depend on non-core services. Similarly, non-core applications cannot depend on core applications. Both types of applications may affect core services. Both delivery warehouse integrated scheduling and intelligent O2O scheduling use the walle-grid computing service. However, the integrated scheduling service is less important and only affects the scheduling result. Comparatively, the O2O service is more important and affects assignment. We isolate the services by version number. If necessary, isolate the services according to group rules or custom routing rules.
(6) Traffic prediction and stress testing
If a dependency is added to a new feature or a large amount of new traffic would occur, the traffic needs to be estimated, and a stress test needs to be performed on a single interface based on the traffic. When the pre-scheduling function is enabled for the similarity scoring service of delivery batch groups, it is estimated to increase the number of batches by 0.5 times. The Cartesian product calculated for 2 batches is 2.25, and it is estimated that full pre-scheduling can increase the traffic by 3 times. The current system can support traffic peaks without adding servers, and it is verified that the system works fine after pre-scheduling is enabled.
3.3 HSF Service Provision
(1) Service timeout
The corresponding service provider must provide a controllable timeout period to prevent the thread pool from being stuffed. The committed timeout period can be obtained from a system stress test, or we can provide a relatively reliable timeout period based on the statistics of EagleEye and continue to optimize it. For the default 3-second timeout period, it should not be used in areas with high stability requirements whenever possible.
(2) Traffic throttling
In addition to setting a relatively reliable timeout period, you also need to limit the traffic that services can provide. For core services, you must add Sentinel for throttling. Sentinel can perform throttling for each server with a granularity of the QPS value or the number of HSF threads. Generally, traffic is throttled based on the QPS, and microservices can be throttled based on the number of threads.
Intelligent scheduling requests the deliveryman service. Under normal circumstances, the service performance of the deliveryman is very high. For simple database UK queries, this service is supposed to be able to handle them. However, during a major promotion, a primary-secondary database switchover occurred (the database was not a single instance at that time, and was affected by other databases), resulting in long-term database unavailability. Then, high QPS traffic flew in and jammed the application thread pool of the deliveryman service. In this case, the traffic was intercepted through throttling, and the upstream scheduling system responded normally with the help of post-caching. Without throttling in this case, the deliveryman service would jam the thread pool, and the entire system would have become unavailable.
The upstream service performs retries, and the downstream ensures the idempotence. Idempotence includes simple ticket idempotence. It also has complex idempotence logic, such as the interface for batch requests. The idempotence logic and the return value of idempotence are agreed upon with the upper stream. The idempotence operation needs to return success, and the server must handle exceptions by itself.
(4) Service caching
Service providers use pre-caching to increase system supporting traffic. This method can be applied to services with relatively stable return values. Whether pre-caching is set for service caching can be evaluated based on the cache hit rate. To support high QPS traffic in some cases, a low cache hit rate is also allowed. A delivery scoring service is a Cartesian product service, and n tickets have n*(n-1)/2 calls. Assume that the service is called every 20 seconds, then the caching period can be set to 1 minute. Although the theoretical hit rate is merely 67%, it doubles the traffic, which can effectively reduce the required number of servers.
Service post-caching is usually used for service degradation logic. The core service, which is the maximum number of collected orders, has three levels of backup logic ( the deliveryman value, city value, and memory value) to ensure the high availability of the server.
(5) Service isolation: Same as above
(6) Traffic estimation and stress testing: Same as above
Continue reading Part 2 of this blog series to find out how Freshippo managed to achieve 0 faults over a year through technological innovation.
While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/supports-your-business-anytime