Utilize the free WAF SafeLine to solve the problem of "Crawlers Occupying Network Bandwidth"

ButterflyI8 - Aug 26 - - Dev Community

1. Background

Related Terms: Frequency Limiting, Access Control, Crawlers, Anti-crawling, WAF, SafeLine

For some automated bots or malicious crawlers, their access to websites tends to be frequent and prolonged. When accessing the cloud server's management backend, one often finds that most of the network traffic is concentrated on one or a few IP addresses. These situations can typically be addressed with a straightforward approach: implementing IP frequency limiting on the server.

However, the function of IP frequency limiting is generally not closely related to business logic, and developers often prefer not to maintain an IP access frequency table themselves. Moreover, manually maintaining information about all visitors in distributed and concurrent environments poses significant development costs.

Chaitin's WAF SafeLine effectively solves this series of problems. SafeLine provides functions such as frequency limiting, port forwarding, manual IP blacklisting and whitelisting, as well as its core function of defending against Web attacks.

2. Instalation

The official website provides several installation methods, which will not be elaborated on in this document. For details, please refer to:SafeLine

3.Configuring Sites and Frequency Limiting Functions

3.1 Site Configuration on SafeLine

The site configuration function of SafeLine is relatively comprehensive, including automatic uploading of TLS certificates and private keys, specifying multiple forwarding ports, etc., eliminating the need for developers to configure nginx forwarding on their own.

Image description

3.2 Configure the frequency limit function

The specific blocking strategy can be customized. It is recommended to limit the number of operations to 100 within 10 seconds and ban the user for 10 minutes.

Image description

Btw, if it's for self-testing or if a false alarm is detected, the ban can be manually lifted.

4. Testing and Other

4.1 Testing

A simple server is prepared in the backend, providing a "hello" interface that takes a parameter named "a".

Write a simple crawler code for testing purposes:

def send_request(url,request_method="GET",header=None,data=None):  
    try:  
        if header is None:  
            header=Config.get_global_config().header  
        response = requests.request(request_method, url, headers=header)  
        return response  
    except Exception as err:  
        print(err)  
        pass  
    return None

if __name__ == '__main__':  
    # config=Config.get_global_config()  
    # print(config.header)    # send_request(header="asad")  
    for i in range(0,100):  
        str = random.choice('abcdefghijklmnopqrstuvwxyz')  
        resp = send_request("http://a.com/hello?a="+str)  
        print(resp.content)
Enter fullscreen mode Exit fullscreen mode

Printing values

b'{"a":"u"}'
b'{"a":"m"}'
b'{"a":"y"}'
b'{"a":"o"}'
b'<!DOCTYPE html>\n\n<html lang="zh">\n  <head>\n .... #
Enter fullscreen mode Exit fullscreen mode

At this time, when you revisit the page, you will find that it has been blocked.

Image description

4.2 What if some cunning crawlers falsify the X-Forwarded-For request header?

SafeLine allows you to directly select "Socket Connection" in the "Global Settings" -> "Get Attack IP From". This indicates that the Source IP is retrieved from the TCP connection.

Image description

If you ask, "What if the crawler is extremely cunning and forges the TCP Source IP field?" Well, due to the forgery of TCP header information, the HTTP handshake based on TCP will directly fail. This means the crawler itself has lost its ability to crawl information, and the access request will be discarded by nginx upon reaching it.

. . . . . . . . . .