Alibaba Cloud

Nov 30, 2017

7 min read

Protecting Websites through Semantics-Based Malware Detection

Abstract: Malware detection is a fundamental feature of web security and serves as the first line of defense for most websites. For the past decade, malware detection using rule-based detection engines have dominated the market. However, this approach is vulnerable to modern web attack techniques, such as virus obfuscation. To cope with these attacks, semantics-based malware detection engines have become increasingly popular over the past two years.

Malware detection is a fundamental feature of web security and serves as the first line of defense for most websites. For the past decade, malware detection using rule-based detection engines have dominated the market. Up until now, most web application firewalls are designed based on rules to detect attacks. In this design, a detection engine runs each session through a series of security tests by comparing against a number of rules. Requests that match the rules are denied.

However, this approach is vulnerable to modern web attack techniques, such as virus obfuscation. To cope with these attacks, semantics-based malware detection engines have become increasingly popular over the past two years. It offers more accurate, intelligent, and humanized detection, capable of detecting threats regardless of how it is presented. Alibaba Cloud has recently released a web application firewall (WAF) equipped with a semantically intelligent detection engine.

Rule-based WAF can effectively prevent known security issues. However to achieve this, security O&M personnel must understand all the features of the attack first and develop rules according to these features. Therefore, rule-based WAF requires a powerful rule library, and the rule library must be updated constantly to deal with the latest attacks.

Although rule-based WAF sounds quite simple, its weakness lies in the lack of flexibility. For security O&M personnel, the number and complexity of rules can be overwhelming, resulting in difficulties in maintaining the rule library. Moreover, O&M personnel often fail to justify the meanings of some rules and the reasons for writing the rule the way that they do. For example, look at the following rule:

It is challenging to decipher the rule shown in the figure above without any context.

In addition, maintaining a large rule library is an enormous challenge for O&M personnel. This is because the WAF rules’ descriptive power is limited and some attack methods and scenarios may fail to be fully covered in the rules.

Let us examine the shortcomings of rule-based detection. Suppose that the string “The greatest truth dwell in ultimate simplicity!” is an attack request. After the WAF successfully blocked the request once, it will update a rule: any requests that contain the strings “ultimate,” “simplicity,” and “!” is an attack. However if the next time a normal request arrives with the string “Simplicity is the ultimate sophistication!” the WAF will categorize it as an attack, producing false alarms.

In addition, attackers may also obfuscate attacks to achieve their goals. It is impossible for a rule to cover every single form of an attack. In the example above, if the attacker obfuscates the attack into “The greatest truth lies in plainness” the WAF will not be able to detect the attack, underreporting the attack.

Many attacks expose the defects and shortcomings of regular expressions. Take a simple web SQL injection attack request for example:' union select version() from dual

This web SQL injection attack request reads the database version information. A regular expression to describe this attack may be written like this:


The role of “s+” is to match one or more invisible characters, such as spaces, line breaks and other symbols. Obviously, attackers familiar with SQL statements can bypass this regular expression check using some database features, for example: replace the space with an annotator “/*11*/”, or “ — %0”. To cope with this attack, the expression can be improved by writing:


It is obvious that the updated regular expression is much more complicated, and the logic does not immediately obvious. More importantly, it is still possible to bypass such a protection rule.

After simple fuzz mining, you can see the new “features” of MySQL to bypass the security rule. For example, the normal structure of a MySQL function call is “function_name()”, but other syntax features are also supported. The function_name/*111111*/(), function_name(), function_name`(), function_name — % 0a(), function_name/**/(), and function_name/*111*/ — 11%0() are all equivalent.

Therefore, the above regular expression rules need to be upgraded:


This simple example reveals that using regular expressions to describe web SQL injection attacks harbors many flaws. Typically, the discovery of a new “feature” leads to the update of more than 100 rules , which is a difficult process for an operation personnel. This is why engines using regular expressions for detection are at risk of being bypassed.

Rule-based detection cannot effectively guard against unknown threats, such as attack variants and zero-day vulnerabilities. In addition, for corporate security O&M personnel, regular expression engines carry high maintenance pressure and costs. Ideally, a detection engine should be able to cope with ever-changing attacks with minimal manual maintenance.

To cope with these attacks, Alibaba Cloud has developed an intelligent detection engine capable of understanding the semantics of attacks.

The unique power of the semantics-based detection engine lies in the fact that it uses semantics, sequences, and situations in natural language into account. It identifies a feature as an attack in one situation and sequence, but not in all situations or sequences. Let’s go back to the example of the attack string “The greatest truth dwell in ultimate simplicity!”

Rule-based engine:

This method identifies attacks by checking for the strings “ultimate,” “simplicity,” and “!” in a request, with a high false alarm rate.

Intelligent semantics-based detection engine:

1.”The greatest truth dwell in ultimate simplicity!” is an attack.
2.”Simplicity is the ultimate sophistication!” is not an attack because the semantics do not match the one in the attack.
3.The rule is able to identify that “The greatest truth lies in plainness” is a variant of the attack through semantics.

From this example, it is obvious that semantics-based detection can anticipate multiple forms of an attack. Intelligent semantics-based detection engine is particularly effective for preventing unknown threats such as zero-day vulnerabilities.

Let’s explore how intelligent semantics-based detection work.


The normalization process involves aggregating the same types of attack behavior and features into one attack feature. Multiple behavioral features of the attack form a specific permutation and combination to represent the same type of attacks so that we can understand and describe the same type of attacks with natural language semantics. The permutation and combination of attack features is the semantization of attacks. The multiple behavioral characteristics of an attack form a specific set of combinations to represent the same kind of attack, so that we can understand and describe the same kind of attacks in terms of the semantics of the natural language. The combination and permutation of attack characteristics is the semantics of attack.

By describing the attack behavior semantically, we can easily disregard the variants of complex attacks.

The following is an example of semanticized analysis of a web SQL injection attack. First, a normalized semantic analysis of the SQL statement is made, and the analysis results are searched in the abnormal attack set. If such a result is found, it indicates that the request is a web SQL injection attack.

For example, consider the following rule:


After normalization, the rule can be described as: “select from “sensitive keyword” “function operation()”. You can use the characters “a, b, c, d, and e” to represent them. Such attacks can then be semantically described as: SQL expressions with sensitive keywords and function operations.

Exception-Based Protection

The prevention of known web security issues alone is passive and lagging. Exception-based protection is more effective.

The basic idea of exception-based protection is to establish a statistical model based on valid application data, and use the model to identify whether the actual communication data is under attack.

In theory, the system can detect any anomalies as soon as the action is made. This eliminates the need for rule libraries, and the detection of zero-day vulnerabilities is no longer a problem.

Alibaba Cloud Security WAF builds a library of intelligent semantic exception attacks based on Alibaba Cloud’s own operation data. WAF builds models for normal web applications, differentiates exceptions from the normal model, and then forms an exception attack set by extracting an exception attack model from a large number of web attacks.


In the future, the intelligent semantics-based detection engine will evolve into a real-time big data analysis engine. The key to the evolution includes the optimization of algorithms, increase of computing capabilities, reduction in costs, and advancement of data clustering and cleaning technologies.

Alibaba Cloud Web Application Firewall (WAF) intelligent detection engine detects various web attacks through semantic attacking behavior and exception-based protection. By describing attacks semantically, Alibaba Cloud WAF intelligent detection engine is able to deal with a wide variety of attacks and their complex variants. Additionally, the security model based on exception statistical detection can not only prevent known web security threats, but also prevent unknown security threats.