Stop 404 prying bots with HAProxy

wasteofserver - Feb 17 - - Dev Community

Stop 404 prying bots with HAProxy

This post was originally posted on https://wasteofserver.com/stop-404-prying-bots-with-haproxy/, you will find newer revisions and additional comments there.

Your logs are filled with 404 hits on /.DS_Store, /backup.sql, /.vscode/sftp.json and a multitude of other URLs. While these requests are mostly harmless, unless of course, your server actually does have something to offer at those locations, you should placate the bots.

Why?

Hitting a server is a resource intensive task and, given that those bots have an extensive list of different URLs, there's no caching mechanism that can help you. Besides, stopping bots is always a safety measure.

We've previously used HAProxy to mitigate attacks on Wordpress login page, the idea is to extend that approach to also cover 404 errors.

Stop 404 prying bots with HAProxy
Bots will try their best to create havoc in your server

I've taken inspiration from Sasa Tekovic, namely on not blocking actual search engine crawlers and allowing 404 on static resources to prevent actual missing resources - an error on your part - from not blocking legitimate users.

Before implementing, it's always good to spin up a local test environment. Let's start HAProxy and Apache using Docker. We do need an actual backend server to give us those 404.

version : '3'

services:
    haproxy:
        image: haproxy:3.1.3-alpine
        ports:
            - "8100:80"
        volumes:
            - "./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg"
        networks:
            - webnet
    apache:
        image: httpd:latest
        container_name: apache1
        ports:
            - "8080:80"
        volumes:
            - ./html:/usr/local/apache2/htdocs/
        networks:
            - webnet

networks:
    webnet:
Enter fullscreen mode Exit fullscreen mode

Then, simply run docker-compose up, and you can access localhost:8100 in your browser.

The haproxy.cfg file is pretty much self-explanatory:

global
    log stdout format raw daemon debug

defaults
    log global
    mode http

frontend main
    bind *:80

    acl static_file path_end .css .js .jpg .jpeg .gif .ico .png .bmp .webp .csv .ttf .woff .svg .svgz
    acl excluded_user_agent hdr_reg(user-agent) -i (yahoo|yandex|kagi|(google|bing)bot)

    # tracks IPs but exclude hits on static files and search engine crawlers
    http-request track-sc0 src table mock_404_tracking if !static_file !excluded_user_agent
    # increment gpc0 if response code was 404
    http-response sc-inc-gpc0(0) if { status 404 }
    # checks if the 404 error rate limit was exceeded
    http-request deny deny_status 403 content-type text/html lf-string "404 abuse" if { sc0_gpc0_rate(mock_404_tracking) ge 5 }

    # whatever backend you're using
    use_backend apache_servers

backend apache_servers
    server apache1 apache1:80 maxconn 32

# mock backend to hold a stick table
backend mock_404_tracking
    stick-table type ip size 100k expire 10m store gpc0,gpc0_rate(1m)

Enter fullscreen mode Exit fullscreen mode

If you get more than 5 hits on 404 requests in a single minute, bot will be banned for 10 minutes.


As it stands, this setup effectively rate-limits bots generating excessive 404s. However, we also want to integrate it with our previous example, where we used HAProxy to block attacks on WordPress.

global
    log stdout format raw daemon debug

defaults
    log global
    mode http

frontend main
    bind *:80

    # We may, or may not, be running this with Cloudflare acting as a CDN.
    # If Cloudflare is in front of our servers, user/bot IP will be in 
    # 'CF-Connecting-IP', otherwise user IP with be in 'src'. So we make
    # sure to set a variable 'txn.actual_ip' that has the IP, no matter what
    http-request set-var(txn.actual_ip) hdr_ip(CF-Connecting-IP) if { hdr(CF-Connecting-IP) -m found }
    http-request set-var(txn.actual_ip) src if !{ hdr(CF-Connecting-IP) -m found }

    # gets the actual IP on logs
    log-format "%ci\ %hr\ %ft\ %b/%s\ %Tw/%Tc/%Tt\ %B\ %ts\ %r\ %ST\ %Tr IP:%{+Q}[var(txn.actual_ip)]"

    # common static files where we may get 404 errors and also common search engine
    # crawlers that we don't want blocked
    acl static_file path_end .css .js .jpg .jpeg .gif .ico .png .bmp .webp .csv .ttf .woff .svg .svgz
    acl excluded_user_agent hdr_reg(user-agent) -i (yahoo|yandex|kagi|google|bing)

    # paths where we will rate limit users to prevent Wordpress abuse
    acl is_wp_login path_end -i /wp-login.php /xmlrpc.php /xmrlpc.php
    acl is_post method POST

    # 404 abuse blocker
    # track IPs but exclude hits on static files and search engine crawlers
    # increment gpc0 counter if response status was 404 and deny if rate exceeded
    http-request track-sc0 var(txn.actual_ip) table mock_404_track if !static_file !excluded_user_agent
    http-response sc-inc-gpc0(0) if { status 404 }
    http-request deny deny_status 403 content-type text/html lf-string "404 abuse" if { sc0_gpc0_rate(mock_404_track) ge 5 }

    # wordpress abuse blocker
    # track IPs if the request hits one of the monitored paths with a POST request
    # increment gpc1 counter if path was hit and deny if rate exceeded
    http-request track-sc1 var(txn.actual_ip) table mock_wplogin_track if is_wp_login is_post
    http-request sc-inc-gpc1(1) if is_wp_login is_post 
    http-request deny deny_status 403 content-type text/html lf-string "login abuse" if { sc1_gpc1_rate(mock_wplogin_track) ge 5 }

    # your backend, here using apache for demonstration purposes
    use_backend apache_servers

backend apache_servers
    server apache1 apache1:80 maxconn 32

# mock backends for storing sticky tables
backend mock_404_track
    stick-table type ip size 100k expire 10m store gpc0,gpc0_rate(1m)
backend mock_wplogin_track
    stick-table type ip size 100k expire 10m store gpc1,gpc1_rate(1m)

Enter fullscreen mode Exit fullscreen mode

Running with two stick tables, and stopping both threats.

And there you have it. HAProxy once again used for much more than as a simple reverse proxy. It's a little Swiss Knife!


Stop 404 prying bots with HAProxy

This headlamp has been a game-changer when working on repairs.

I had one, but when it broke, hesitated to replace it, resorting to the phone’s flashlight. And sure, it works - but once you experience the convenience of having both hands free again - there’s no going back. If you need reliable, hands-free lighting, this is a must-have!

. . . . . . .