5 Ways To Check For Duplicates In Collections, With Benchmarks

Milan Jovanović - Sep 10 '23 - - Dev Community

In this week's newsletter, we will take a look at five different ways to check if a collection contains duplicates.

I'm going to explain the idea behind each algorithm , discuss the algorithm complexity (Big O Notation), and at the end, we'll look at some benchmark results.

The five approaches for finding a duplicate will use the:

  • foreach loop
  • LINQ Any method
  • LINQ All method
  • LINQ Distinct method
  • LINQ ToHashSet method

Let's see how we can implement each approach.

Check For Duplicates With ForEach Loop

The first implementation will use the foreach loop and the HashSet data structure.

Here's the code for the ContainsDuplicates method:

public bool ContainsDuplicates<T>(IEnumerable<T> enumerable)
{
   HashSet<T> set = new();

   foreach(var element in enumerable)
   {
      if (!set.Add(element))
      {
         return true;
      }
   }

   return false;
}
Enter fullscreen mode Exit fullscreen mode

The idea is simple:

  • Loop through the collection
  • Add each element to the HashSet
  • When HashSet.Add returns false we found a duplicate
  • If we loop through the entire collection there are no duplicates

In terms of algorithm complexity , this would be O(n) or linear complexity. This is because there's only one iteration through the collection.

Adding an element to a HashSet is a constant operation - O(1). So it doesn't affect the overall complexity.

Check For Duplicates With LINQ Any

We'll combine the idea from the previous implementation of using the HashSet and pair it with the LINQ Anymethod to iterate over the collection.

Here's the implementation for the ContainsDuplicates method:

public bool ContainsDuplicates<T>(IEnumerable<T> enumerable)
{
   HashSet<T> set = new();

   return enumerable.Any(element => !set.Add(element));
}
Enter fullscreen mode Exit fullscreen mode

You can see this implementation is significantly shorter. But it works the same as the one with the foreach loop.

If any element in the collection satisfies the specified expression,Any will short-circuit and return true. Otherwise, it will iterate over the entire collection and return false.

We're still looking at linear complexity here, O(n).

Check For Duplicates With LINQ All

For our third implementation, we will use the opposite of the LINQ Any method - the LINQ All method.

Here's the implementation with LINQ All:

public bool ContainsDuplicates<T>(IEnumerable<T> enumerable)
{
   HashSet<T> set = new();

   return !enumerable.All(set.Add);
}
Enter fullscreen mode Exit fullscreen mode

The idea here is a little different than in the previous implementation.

All will return true if all elements in a collection satisfy the specified expression.

If at least one element doesn't satisfy the condition - in our case when a duplicate is found - it will short-circuit and return false.

This is still linear complexity, O(n).

Check For Duplicates With LINQ Distinct

So far, we've seen a few implementations using the HashSet data structure. Now let's consider a different approach.

We can use the LINQ Distinct method to check for duplicates.

Here's the code for the ContainsDuplicates method:

public bool ContainsDuplicates<T>(IEnumerable<T> enumerable)
{
   return enumerable.Distinct().Count() != enumerable.Count();
}
Enter fullscreen mode Exit fullscreen mode

The idea is first find the Distinct elements and Count them, and then compare that to the number of all elements.

If the number of distinct elements is not equal to the number of all elements, we have a duplicate value.

In terms of algorithm complexity , this is still linear complexity.

But we have at least two iterations through the collection or three in the worst-case scenario.

We have one iteration for Distinct and one more iteration for the call to Count right after that. The last call to Count can return in constant time, if the collection is an array or List.

Check For Duplicates With LINQ ToHashSet

For the last implementation we will use the LINQ ToHashSet method.

It takes a collection and creates a HashSet instance from it.

Here's what the ContainsDuplicates implementation looks like:

public bool ContainsDuplicates<T>(IEnumerable<T> enumerable)
{
   return enumerable.ToHashSet().Count != enumerable.Count();
}
Enter fullscreen mode Exit fullscreen mode

We compare the number of elements in the HashSet to the number of elements in the collection.

If they are different, we have a duplicate value.

This is also linear complexity, O(n).

Benchmark Results

Now that we've seen our implementations let's put them to the test.

I ran the benchmark for collections of varying sizes:

  • 100
  • 1,000
  • 10,000

Each collection contains exactly one duplicate value located somewhere around the middle of the collection.

Here are the results:

Benchmark results

The approach using the foreach loop comes out as the clear winner in terms of performance.

However, I would lean towards using the implementations with LINQ Any or All because of their simplicity.

You can find the source code for the benchmarkon my GitHub. Feel free to submit a PR with a faster implementation if you can think of one.


P.S. Whenever you're ready, there are 2 ways I can help you:

  1. Pragmatic Clean Architecture: This comprehensive course will teach you the system I use to ship production-ready applications using Clean Architecture. Learn how to apply the best practices of modern software architecture. Join 950+ students here.

  2. Patreon Community: Think like a senior software engineer with access to the source code I use in my YouTube videos and exclusive discounts for my courses. Join 820+ engineers here.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .