Safeguarding Double 11 with MTEE3 Intelligent Risk Control Platform
By Ran Guang (Huameng)
“It was an incredible experience! If the 48% failure did happen on Double 11, it would have been a disaster for the whole Security Department.” Speaking swiftly, Zhiming beamed with excitement like a developer just fixing a major bug.
What is MTEE3? Why is that 48% failure so critical?
Zhiming, a senior expert in business security product technology at Alibaba Security and the technical lead for the intelligent risk control platform, MTEE3, narrates the complete story exclusively for this article.
MTEE3: High-Performance and Intelligence
MTEE3 is an intelligent risk control platform for business security. The numeral “3” at the end signifies the latest version 3.0. It is designed to provide a plethora of protection services, including account security, anti-scalping, anti-fraud for marketing campaigns, content security, and captcha to safeguard the core businesses of the Alibaba group ecosystem. During the 2017 Double 11 Shopping Festival, MTEE3 ran over 30 billion business risk scans with a peak rate of 2 million scans per second. Such global performance figures clearly showcase the powerful capabilities of the MTEE3.
To begin with, Zhiming describes the concept of business security and the rapidly growing concerns regarding the various security aspects.
To put it simply, Zhiming defined MTEE3 as “a platform that protects the security of businesses.” Traditional security threats such as account takeovers and fake accounts (accounts registered in large batches through programs) hamper the operations of a website and negatively impact the overall performance of websites. Cybercriminals use such accounts to grab coupons or make a profit by exploiting the loopholes.
Zhiming explains that at Alibaba, “the tool that defends against the ‘wool-pulling party’ (羊毛党) is called marketing anti-fraud. There are many hot-sale products on our marketplaces such as alcohol and mobile phones, and therefore we need to defend against scalping. We also offer defensive tools such as content protection and captcha that easily identify automated behaviors. All these protection measures work above the network layer, and hence are called business security.”
Alibaba’s business security service relies on real-time analysis and modeling technology based on big data to deliver quick and effective protection. It uses technologies such as rule engine, model engine, relational network, group analysis, device profiling, semantic analysis, and machine vision to compute endless data metrics concerning each user behavior in real-time.
MTEE3 is the platform that facilitates all these computing requirements and supports related technologies. A large number of rules and models are deployed on MTEE3 to protect multiple businesses in the Alibaba ecosystem. A user’s behavior is called an ‘event’, for example, user registration, user login, basic information modification, chats, order placement, payment, shipment, product delivery, and review. “At Alibaba, we conduct the protection and control of every event,” said Zhiming. This end-to-end defense capability allows MTEE3 to recognize malicious accounts with ease. With millisecond-level responsiveness, MTEE3 controls risk scanning during the order placement step within 10 milliseconds, which is almost imperceptible to customers.
MTEE3 is not just powerful but highly intelligent as well. Zhiming explains this in detail.
MTEE3 analyzes a set of variables (metrics) to determine whether an account is a normal user, a machine, or a scalper. These variables have multiple dimensions including account number, device, environment, content, and user behavior.
“MTEE3 completes real-time computation and analysis of the information in a very short period of time,” said Zhiming. Alibaba Security engineers empower MTEE3 to work in a compute-while-storing mode. Therefore, instead of storing all the data and queries in databases, MTEE3 performs computation on information streams. It calculates to obtain results, returns them to the transaction, and then stores them as well. Zhiming said, “MTEE3 has the stream computing capability.”
While rule-based and model-based security have been utilized for Double 11 every year, the Security Strategy Center team made a breakthrough by launching Decision Balance that uses machine learning algorithms to facilitate intelligence-based decision-making. Its debut in Double 11 was a success. Based on multiple factors including risk control, user experience, and business considerations, Decision Balance uses a global optimization algorithm to find the optimal solution. It employs a reinforcement learning algorithm to modify the optimal solution considering the changes in risk distribution and produces the risk disposal decision for the next moment. Furthermore, with real-time computing capability, the system automatically executes the decision and updates decision plans in seconds. “With Decision Balance as the prototype, we are building the next generation of risk control models,” said Zhiming.
Flashback of 2017 Double 11
For Zhiming and his team, the achievements during the 2017 Double 11 was quite impressive.
Performance was the first challenge to surmount and it didn’t seem very difficult if they could simply add more servers. However, things were not so straightforward as Zhiming and his team had to handle a Double 11 transaction peak that was twice the size compared to the previous year amid a slight increase in resources.
Alibaba Security engineers virtually redeveloped the computing engine to make it work faster and ensure performance improvement by 100%. To improve overall efficiency, they optimized the policy system deployment to allow the computing engine to coordinate with other security layers (the network layer, for instance) in real-time.
Additionally, the Security Policy Center team and the Product Technology team worked together to reconstruct the policy system for building a layered and systematic policy framework. They eliminated policy silos and combined rules with the machine learning model to build a brand-new defense perimeter with increased risk coverage and accuracy.
Incentive plans were subject to change even in the last two days before Double 11. Meanwhile, strategies, models, and rules would change accordingly. Besides, they had no idea where the attackers would emerge. Although these uncertainties put the Alibaba Security team under tremendous pressure, Zhiming and his team managed to propose a solution. “This year we are going to tolerate changes because of these uncertainties. In particular, for the computing engine. In the event of a policy change, we want to ensure that the performance of the system continues at the same pace with resources consumption rather than moving linearly,” Zhiming said. The MTEE3 team worked hard to reconstruct the rule engine and the model engine. They completely redeveloped the entire rule engine. Pos transformation, the performance of MTEE3 increased manifolds.
“Double 11 was an important milestone for this project, but we did not stop there. We were also upgrading for the reconstruction of policies. We implemented the upgrade while the computing engine was running. This was as challenging as changing the engine of an airplane during the flight,” Zhiming said.
MTEE3 was launched in March 2017. However, it was not employed during the June 18 promotion and debuted in the September 9-Wine Festival. And then came Double 11.
Surprisingly, MTEE3 received the last requirements change on November 8, when no new requirement change should be accepted. However, after an extensive assessment, the leaders of relevant teams decided that this change must be implemented.
As a result, at 22:00 hours of November 9, Zhiming and his teammates were still busy testing MTEE3. It was not until 07:00 hours the next day that all function points were verified after several rounds of testing.
Everything was ready. However, at 00:00 hours of November 10, a security policy engineer discovered a major problem. The security policies failed to intercept 48% of threats in the order placement scenario. The situation worsened as the security engineers were not sure whether all of the policies were wrong or just one. It was just less than 24 hours before 2017 Double 11. Zhiming said, “we were planning a day off before the shopping festival, though had no choice but to wake everyone up for troubleshooting. We worked frantically until 03:00 hours and eventually were able to confirm that it was a false alarm. It was really a heart-stopping experience.”
MTEE3 protects hundreds of millions of funds. If the 48% failure had happened during Double 11, the result would have been catastrophic. Zhiming states, “this year is different as we had more pressure during the preparation stage. Especially with the dramatic 48% failure in mind. If the failure did happen on Double 11, it would have been a disaster for the whole Security Department.”
Into the evening of November 10, Zhiming was still discussing issues about defense target groups with the Policy Center team, and all policies were ultimately finalized at 20:00 hours.
By 00:00 hours of November 11, the engineers in charge of MTEE3 felt rather relaxed. Zhiming recollects, “Last year, we stayed up for 38 hours, with cross-border activities included. This year, most of our teammates returned home just after 02:00 hours and had a good night’s sleep.”