Fighting an Endless War with Crawlers

7 min readSep 14, 2018

By Jun Yang, Alibaba Web Application Firewall (WAF) Team

Crawler = Crawling Data?

The reason why I raise the old-fashioned question “what is a crawler” is because a user who had a verification code interface refreshed a few days ago discussed the protection scheme in the group. He thinks that this is not a crawler, but one that crawls data, such as air tickets, hotels, accommodation prices, news, novels, comics, and comments, SKU, is a crawler.

The user has a valid point. The traditional definition of crawler is like this, but the crawler discussed in this document refers to any program that automates a series of web requests for certain purposes. These purposes include, but are not limited to, rigging a poll that allows you to win in an online competition, cracking your verification code (or sending the verification code to the coding platform), and simulating normal users to book tickets without paying (blocking normal ticket purchase).

Nowadays, crawlers tend to be obviously “benefit-seeking”. Crawlers have caused losses in the business revenue, corporate reputation, and core data, by acquiring core business information (such as pricing strategy and user information) to disrupting normal user activities (such as snapping and malicious ticketing).

Is the Confrontation between Anti-crawler and “Anti-Anti-Crawler” Endless?

This issue is dialectical. According to Alibaba Cloud security team, the answer really depends on the level of crawler you are fighting against. We can group crawler traffic sources on the Internet into the following categories:

Professional black hat hackers
Advanced hackers using a large number of proxy IP addresses
Attackers using simulator
Attackers good at disguise
Beginners

As you can see, the list is arranged in a descending order of confrontation difficulty.

Fortunately, low-tech crawlers account for a larger proportion. After all, both offense and defense require money. Most crawlers will target others if they fail to crawl data from you. They won’t bother to study advanced anti-crawler policies. After all, most sites are not protected against crawlers.

Unprotected websites, apps and APIs are targets of crawlers.You have to protect all your websites, mobile apps, and APIs to avoid any object becoming the target of crawler attacks. As the saying goes, the chain is only as strong as the weakest link. Therefore, an anti-crawler system that is applicable in all scenarios is crucial.

What Is Depth?

You must adapt your methods based on the situation. You need different approaches to deal with various levels of attackers, as we just talked about. Our countless days and nights fighting against web crawlers have witnessed large-scale credential stuffing activities by simply banning an IP address and round-the-clock black production teams with sophisticated monitoring systems and technicians.

From the perspective of confrontation, an absolutely secure system that will never be bypassed does not exist. What we can do is to continuously increase the bypass cost for attackers, which increases with the rich exponential level of protection.

Let’s organize the anti-crawler ideas: direct feature library banning, JavaScript transparent human-computer recognition, abnormal behavior detection, and threat intelligence database.

1. Feature Detection

Experienced security personnel can quickly detect abnormal behaviors in the access log, such as:

Normal users will not directly request page access without any referer.
Requests that are redirected from the primary domain do not carry any cookies.
UA includes Python/Java/xxBot/Selenium.
A lot of overseas IP addresses emerge in a provincial life forum.
The request body contains a large number of same phone numbers.

These obvious or inconspicuous “features” can be used as the first crawler detection policy — feature banning. The features here may be various HTTP headers, bodies, and their combinations. Alibaba Cloud crawler risk management product provides flexible seven-layer access control policies

2. JavaScript Transparent Human-Computer Recognition

In addition to access control, another common idea is to determine whether a request comes from an automated tool by collecting operating behaviors, device hardware information, and fingerprints in the web environment using JavaScript. The idea is simple, but it is painstaking for professional security teams to ensure the accuracy of the collected information and risk judgment model in the front-end confrontation environment without any secret.

Alibaba has integrated the verification codes accumulated for years into anti-crawler products. Users can access services without any business transformation and acquire the same human-computer recognition capability as Taobao in one click. The products feature good performance (but senseless for decent users) in scenarios such as protective garbage registration, credential stuffing, brute force cracking, verification code refreshment, and malicious ordering, as shown in the following figure.

3. Abnormal Behavior Detection

When it comes to abnormal behaviors, most of us think of speed limiting for overly active servers. That’s right, but there are a lot of details about speed limiting. For example, which path is used as the speed limiting condition?

Of course, abnormal behavior detection is much more than just speed limiting. It is also a good idea to model from the perspective of user behavior analysis (UBA) with machine learning methods. Machine learning combines multiple angles and dimensions to identify crawlers, which increases the bypassing and confrontation cost. This is an advantage over rule-based protection. In addition, machine learning can make up for the weakness of rule detection for specially constructed low frequency and discrete IP addresses.

Currently, Alibaba Cloud security algorithm team has launched more than a dozen models to identify malicious crawlers in various scenarios, including timing exceptions, request distribution exceptions, business logic exceptions, context exceptions, and fingerprint anomalies. Alibaba Cloud platform provides the powerful real-time computing capability to facilitate real-time abnormal behavior detection. Real-time is crucial, because as crawlers become increasingly smart, advanced crawlers are difficult to be identified when common algorithms give you results, which seriously reduces the efficiency to identify crawlers.

4. Threat Intelligence Capabilities

Collaborative threat defense is a powerful way to ward off the threat of crawlers. In the aviation industry, air tickets are always the focus of crawlers. From the perspective of crawlers, a crawler behind a scalper or travel agency often visits major airlines to obtain the most complete fare information. Therefore, when we detect a crawler visiting A, B, C, D, and E airlines, is the crawler likely to crawl on the X airline? Yes, of course. This is not an assumption, but a fact we have encountered in actual traffic.

So we expand our idea from this. The crawler library generated in real time (note the “real-time” word) with suspicious behaviors at a number of airline websites based on a certain model is a typical collaborative defense model on the cloud. For the newly accessed X airline, we can protect its security in advance by using the intelligence idea.

In fact, it is easy to understand even from the perspective of attackers. Although today’s professional crawlers rent large-scale proxy IP pools and broadband IP pools so as to avoid being detected, the attack cost and resource reuse are issues to be considered. Therefore, different attackers may get the same IP pool from the IP trafficker.Therefore, when proxy IP pools increase to a certain amount, a high coincidence emerge in malicious behaviors.

At present, Alibaba Cloud security team has analyzed a large number of threat intelligence databases from the cloud traffic. Based on the powerful computing power of the cloud platform, the traffic can be calculated based on the past one hour/day/week (different scenarios) in response to the rapidly changing black-gray-industry resource pool. This is another important component of our anti-crawler system.

A Good Anti-Crawler System Should Reflect the Value of Users

Attack and defense are always dynamic. No policy is ideal for all scenarios. Therefore, a good security product should reflect users’ value and help security engineers make full use of their expertise and experience. The Alibaba Cloud security team is committed to providing crawler risk management products and creating a set of flexible “tools”, helping users to skip cumbersome implementation details and deploy protection rules directly at the policy or even business level. In addition, Alibaba uses its massive data and computing power, elastic capacity expansion, and threat intelligence on the cloud to help users customize anti-crawler systems quickly.

In addition, based on Alibaba Cloud’s Log Service, we can analyze the current traffic quickly or set personalized service indicator monitoring and alarms, such as sorting the access times of an IP address under a domain name in the last half hour, hitting and bypassing of a certain policy, monitoring the number of registrations/orders per minute, the situation of sliders popping up and passing. In this way, we complete a closed-loop process from detection, disposal, monitoring, and confrontation.

Conclusion

Anti-crawler and anti-anti-crawler technologies are at an endless war. Like any war, it is the fight for resources. Because a large majority of attacks are at a basic level, hardening your website, app and API’s security can protect you from most crawlers. For advanced crawlers and anti-anti-crawlers, using professional products such as that from the Alibaba Cloud Security Team can provide you with an additional layer of protection.

Reference:

https://www.alibabacloud.com/blog/fighting-an-endless-war-with-crawlers_593977?spm=a2c4.12011610.0.0