Battle of the WAFs: Testing Detection and Performance Across Open-Source Firewalls

Lulu - Aug 27 - - Dev Community

Recently, I had the opportunity to recommend some valuable security tools to my clients, with Web Application Firewalls (WAFs) being among the top suggestions. WAFs are essential for protecting web applications by filtering and monitoring HTTP traffic, and their attack protection capabilities are crucial.

In this article, I'll walk you through how to scientifically test the effectiveness of a WAF, ensuring you make an informed decision when selecting one for your web environment.

Testing Methodology

To keep the test fair and unbiased, all the target machines, testing tools, and test samples I used are open-source. The test results are evaluated based on four key metrics:

  1. Detection Rate: This reflects how comprehensive the WAF’s detection capabilities are. A lower detection rate indicates a higher number of undetected threats, known as "false negatives."

  2. False Positive Rate: This measures how often the WAF mistakenly blocks legitimate traffic, known as "false positives." A higher false positive rate could interfere with normal web traffic.

  3. Accuracy Rate: This combines the detection and false positive rates to give an overall picture of the WAF’s reliability, helping to avoid the trade-off between missed detections and false alarms.

  4. Detection Latency: This indicates the performance of the WAF, with longer latency suggesting slower performance.

Calculation of Metrics

Detection latency is straightforward and can be calculated directly using tools. However, calculating the other metrics requires a bit more detail, which can be understood using predictive classification concepts in statistics:

  • TP (True Positives): The number of attack samples that were correctly intercepted.
  • TN (True Negatives): The number of legitimate requests that were correctly allowed through.
  • FN (False Negatives): The number of attack samples that slipped through the WAF undetected.
  • FP (False Positives): The number of legitimate requests that were wrongly intercepted.

Here’s how you can calculate the key metrics:

  • Detection Rate: TP/(TP+FN)
  • False Positive Rate: FP/(TP+FP)
  • Accuracy Rate: (TP+TN)/(TP+TN+FP+FN)

To minimize errors and reduce randomness, I divided "detection latency" into two metrics: "90% average latency" and "99% average latency."

Test Samples

All test data was sourced from my own browser, and I used Burp Suite to capture the traffic. Here’s how I gathered the samples:

  • White Samples: These represent normal traffic. I collected 60,707 HTTP requests by browsing various forums, totaling 2.7 GB of data (this took about 5 hours).

  • Black Samples: These represent attack traffic. I used four different methods to gather a total of 600 HTTP requests, also over 5 hours.

    • Simple common attack traffic: I deployed the DVWA (Damn Vulnerable Web Application) target machine and tested all common vulnerabilities one by one.
    • Common attack traffic: I used all the attack payloads provided by PortSwigger's official website.
    • Targeted vulnerability traffic: I deployed the VulHub target machine and attacked classic vulnerabilities using default PoCs (Proof of Concepts).
    • Attack confrontation traffic: I increased DVWA's protection level and launched attacks under medium and high protection settings.

Testing Setup

With the indicators and samples ready, I needed three components: the WAF, a target machine to receive traffic, and testing tools.

  • WAF Configuration: All WAFs were tested with their default configurations.
  • Target Machine: I used Nginx, configured to return a 200 status code for any received request:

    location / {
        return 200 'hello WAF!';
        default_type text/plain;
    }
    
  • Testing Tool Requirements: The tool I used had to:

    • Parse Burp’s export results.
    • Repackage data according to the HTTP protocol.
    • Remove the Cookie Header (since the data will be open-sourced).
    • Modify the Host Header to ensure the target machine receives traffic normally.
    • Determine if a request was intercepted by the WAF based on whether a 200 status code was returned.
    • Send packets evenly after mixing black and white samples.
    • Automatically calculate the testing indicators.

Testing Results

SafeLine WAF
Image description

Coraza
Image description

ModSecurity
Image description

Baota WAF
Image description

nginx-lua-waf
Image description

SuperWAF
Image description

Comparison Table

Image description

Conclusion

Among the tested WAFs, SafeLine WAF showed the best overall performance, with the fewest false positives and false negatives. Coraza and ModSecurity had high detection rates but suffered from too many false positives, which may not be ideal for real-world applications.

It’s important to note that different test samples and methods can lead to varying results. Therefore, it’s crucial to choose the right samples and methods that align with your specific environment.

These results are meant to provide insights but should not be the sole factor in evaluating WAFs. Always consider your unique needs when choosing a security solution.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .