Introduction
The Google Hash Code is a team coding competition organized by Google. The qualification round usually takes place in February and the organizers usually publish a practice challenge to warm you up. In this post, I explain how I solved this challenge with just 70 lines of Python code.
Description of the challenge
As usual, the practice challenge for the Hash Code 2022 is about Pizzas. This time, the goal is to find the only pizza that appeals to as many people as possible.
The input of the challenge is a list of potential customers with the list of ingredients they like and the list of ingredients they dislike. For example:
3
2 cheese peppers
0
1 basil
1 pineapple
2 mushrooms tomatoes
1 basil
means that we have 3 customers :
- The first potential customers likes 2 ingredients, namely "cheese" and "peppers" and dislikes nothing (0 ingredient).
- The second potential customers likes 1 ingredient, namely "basil" and also dislikes 1 ingredient, namely "pineapple".
- The third potential customers likes 2 ingredients, namely "mushrooms" and "tomatoes" and dislikes 1 ingredient, namely "basil".
A customer would buy the pizza if it has all the ingredients he likes and none of the ingredient he dislikes.
So using the set theory notation, given a pizza with the set of ingredients , the customer with the set of ingredients he likes and the set of ongredients he dislikes would buy it if and only if and .
To submit the solution, we just write a single line with the number of ingredients followed by the list of the ingredients on the pizza. For example:
4 cheese mushrooms tomatoes peppers
means that the only pizza we serve has 4 ingredients, namely "cheese", "mushrooms", "tomatoes" and "peppers".
Analysis
One possible solution for this problem would be to try all possible combination of ingredients. This would work if the number of ingredients is rather small, but the most difficult input has 10000 ingredients and there are possible combinations (this is a 2 followed by 3010 zeros)!
As usually with such challenges, we need another approach. The most basic one consists of computing many pizzas with a random set of ingredients and keeping the one with the best score. This would work and would be very fast, but it will not converge toward the best solution. But we can keep the idea and use a genetic algorithm to improve the convergence. This is what we are going to do!
Coding
The Hash Code can be solved with any programming language, but for this example, I will use Python 3. Python 3 is a rather simple language and there is a good library for genetic algorithm. So let's begin.
I start by implementing two classes :
- The
customer
(with the ingredients he likes and dislikes) - The
challenge
itself (with the list of customers, the set of all ingredients, the solution, and the score)
from dataclasses import dataclass, field
@dataclass
class Customer:
likes: set[str]
dislikes: set[str]
@dataclass
class Challenge:
customers: list[Customer] = field(default_factory=list)
all_ingredients: set = field(default_factory=set)
solution: set = field(default_factory=set)
score: int = 0
Then in the Challenge
class, I implement a method to reset the challenge (if I want to keep the instance of this class for several inputs):
class Challenge:
def reset(self):
self.customers = []
self.all_ingredients = set()
self.solution = set()
self.score = 0
I also implement a method add_customer
that appends the customer to the list and update the set of all ingredients with the one that this customer likes. Note that there is no need to add the ingredients that he dislikes because there is no benefit in having them on our pizza.
class Challenge:
def add_customer(self, customer):
self.customers.append(customer)
self.all_ingredients |= customer.likes
Now I need a method that computes the score of a given pizza. The score is the number of customers who would buy this pizza. To do that, we iterate over all customers and test if they would buy it with the formula that we found earlier (
and
):
class Challenge:
def evaluate(self, pizza: set[str]) -> int:
result = 0
for c in self.customers:
if c.likes & pizza == c.likes and c.dislikes & pizza == set():
result += 1
return result
Now this is where the magic occurs! In the method solve
, we implement a fitness function that takes a solution candidate as argument and returns the score. The candidate is a NumPy
binary vector of the size of the set of all ingredients and where a 1 means that this ingredient is chosen and a 0 means that the ingredient is not chosen.
We let then the PyGAD package do the job of selecting the best candidate:
class Challenge:
def solve(self, generations=100):
ingr_list = sorted(list(self.all_ingredients))
def fitness_func(solution, solution_idx):
pizza = set([ingr_list[k] for (k,v) in enumerate(solution) if v == 1])
return self.evaluate(pizza)
ga_instance = pygad.GA(
num_generations=generations,
num_parents_mating=2,
sol_per_pop=3,
num_genes=len(ingr_list),
fitness_func=fitness_func,
init_range_low=0,
init_range_high=2,
random_mutation_min_val=0,
random_mutation_max_val=2,
mutation_by_replacement=True,
gene_type=int)
ga_instance.run()
solution, solution_fitness, solution_idx = ga_instance.best_solution()
self.solution = set([ingr_list[k] for (k,v) in enumerate(solution) if v == 1])
self.score = solution_fitness
If we plot the fitness function for the difficult challenge (600 ingredients), we see that we almost reach a score of 1550 points in 2000 generations:
We then add a process
method that deal with file reading and writing and this is the final code:
from dataclasses import dataclass, field
import pygad
@dataclass
class Customer:
likes: set[str]
dislikes: set[str]
@dataclass
class Challenge:
customers: list[Customer] = field(default_factory=list)
all_ingredients: set = field(default_factory=set)
solution: set = field(default_factory=set)
score: int = 0
def reset(self):
self.customers = []
self.all_ingredients = set()
self.solution = set()
self.score = 0
def add_customer(self, customer):
self.customers.append(customer)
self.all_ingredients |= customer.likes
def evaluate(self, pizza: set[str]) -> int:
result = 0
for c in self.customers:
if (c.likes & pizza == c.likes and
c.dislikes & pizza == set()):
result += 1
return result
def solve(self, generations=100):
ingr_list = sorted(list(self.all_ingredients))
def fitness_func(solution, solution_idx):
pizza = set([ingr_list[k] for (k,v) in enumerate(solution) if v == 1])
return self.evaluate(pizza)
ga_instance = pygad.GA(
num_generations=generations,
num_parents_mating=2,
sol_per_pop=3,
num_genes=len(ingr_list),
fitness_func=fitness_func,
init_range_low=0,
init_range_high=2,
random_mutation_min_val=0,
random_mutation_max_val=2,
mutation_by_replacement=True,
gene_type=int)
ga_instance.run()
solution, solution_fitness, solution_idx = ga_instance.best_solution()
self.solution = set([ingr_list[k] for (k,v) in enumerate(solution) if v == 1])
self.score = solution_fitness
def process(self, filename, generations=100):
self.reset()
with open(f"in/{filename}.in.txt") as f:
n = int(f.readline())
for i in range(n):
customer = Customer(
set(f.readline().strip().split()[1:]),
set(f.readline().strip().split()[1:])
)
self.add_customer(customer)
print(len(self.all_ingredients))
self.solve(generations)
with open(f"out/{filename}.out-{self.score}.txt", "w") as f:
f.write(f"{len(self.solution)} ")
f.write(" ".join(self.solution))
c = Challenge()
c.process("a_an_example", 1000)
c.process("b_basic", 1000)
c.process("c_coarse", 1000)
c.process("d_difficult", 1000)
c.process("e_elaborate", 1000)
The input files are expected in the ./in
directory and you need a ./out
directory where the program can write the results.
Conclusion
We were able to solve the Hash Code 2022 Practice Challenge with just 70 lines of Python 3 code and the magic of the PyGAD. The code is not yet optimized and I am sure that we can improve the operations on sets and also tune the settings of PyGAD to get better and faster results, but this goes beyond the scope of this post.
Credit : Cover photo by Vitalii Chernopyskyi on Unsplash