Using Probability and Statistics to Predict Sportive Results

Lisandra Melo - Aug 13 '20 - - Dev Community

This article is an English translation of my article which was written on Brazilian Portuguese and posted here on my dev.to profile.

Initial Considerations

In this article we will use mathematical concepts like Expected Value and Probability Distribution, if you don’t know much about these concepts, you may still understand everything that’s being done in the article, but if you want to learn more about the content, I indicate the Khan Academy website especially the modules on Probability Distribution and Average - Expected Value, they are short and very explanatory videos about the concepts.

Introduction to the Project

In this project, we will use probability and statistics to predict the results of football matches. For this, we will use Python and its Numpy library, along with concepts of probability and statistics.

We will perform the following process, we will read a file containing all the results of AFC Ajax matches in the Dutch football league (Eredivisie) during the 18/19 season and we will, for each round, predict the score of the next match of the team, this prediction will consist of the Expected Value (EV) of goals scored by the club and the EV of goals conceded.

What Happens If We Guess Random Values?

For future reference, we will look at what would happen if we tried to guess the results of the matches using random values.

We will consider that Ajax can score from 0 (minimum number of goals scored in a match in our data) to 8 (maximum number recorded by Ajax in a match in our data) goals, that is, a total of 9 possible results, and allow from 0 (registered minimum) to 6 (registered maximum) goals, a total of 7 possibilities. We have that the probability of getting a match score prediction right is to get the right number of goals scored and the right number of goals conceded, so if we choose random values in the determinated interval we will have:

P(goalsScored)=19P(goalsAllowed)=17P(matchScore)=P(goalsScored)P(goalsAllowed)=1917=163=0,0159 P(goalsScored) = \frac{1}{9} \newline P(goalsAllowed) = \frac{1}{7} \newline P(matchScore) = P(goalsScored) * P(goalsAllowed) = \frac{1}{9}* \frac{1}{7} = \frac{1}{63} = 0,0159

The Dutch league has a total of 34 matches, we will not make predictions for the first round, as we have no previous data to help us calculate a prediction. So, considering that we have 33 matches to try to get at least one right score, we will multiply 33 by the probability of a right match score, which gives us a value of around 0.5238 right score. This means that without mathematical tools, using random values, we are expected to get the right score of less than one match of the 33 analyzed. For the number of goals scored on a match, we have an expected value of 3.6667 (33 * 1/9) right results and for goals conceded 4.7143 (33 * 1/7).

So let's try to improve these values (which are very low) using math and programming.

Project Implementation

To create our project, first, we will create our scores file, this file will have a specific format and will be written as:

goalsscored,goalsconceded
Enter fullscreen mode Exit fullscreen mode

For example, if Ajax scored 4 goals and conceded 2 in a match we will have in the file:

4,2
Enter fullscreen mode Exit fullscreen mode

This file will be named resultados.txt, and it is available in the project repository.

Now we are going to start the coding part of our project! We will begin importing the necessary library.

import numpy as np
Enter fullscreen mode Exit fullscreen mode

Then we will open our scores file.

# Opening the file with our scores
fileResults = open("resultados.txt", "r")
Enter fullscreen mode Exit fullscreen mode

After opening the file, we will insert the contents of the file into a list called matchesScores using a list comprehension, which is a way of defining, creating, and maintaining lists in python. With this tool, we can create an iterator and fill lists within a single line of code.

At the end of the iteration, we will close the file (resultados.txt) that was opened at the beginning of our code.

# Declaring our score list
matchesScores = []

# The for loop will work with every line of the file in each iteration
for lineofFile in fileResults:
    """
   The next line of code will add the contents of a file line,
   inside the braquets we have a list comprehension which
   does the exact same work as the following code:
   list = []
    for x in l.split(","):
        list.append(int(x))
    results.append(list)
    """
    matchesScores.append([int(x) for x in lineofFile.split(",")])

# The we will close our file
fileResults.close()

Enter fullscreen mode Exit fullscreen mode

Now we will start analyzing the data obtained. But first, we will initialize some variables that will store our formatted data.

# We Will declare two lists, one containing the goals scored and one with the goals conceded
goals_scored = []
goals_conceded = []

# We will declare the number of time we got the goals scored, goals conceded and both of them right
right_round = 0
right_goals_scored = 0
right_goals_conceded = 0
Enter fullscreen mode Exit fullscreen mode

We will then iterate through the entire matchesScores list, separating the values it contains in goals scored and conceded and then calculating the expected value of each of these categories to calculate a score prediction for the next round.

For it, we will obtain the frequency of each number of goals, that is, how many times the team has scored 0 goals, 1 goal, 2 goals, and so on. We will do the same with the goals conceded. With the frequency of each number of goals, we will have the data to calculate our expected value.

For example, we can have a frequency like the one shown in the graph below (This is not the actual frequency of the data).

Example of how the frequency could look like
Example of how the frequency could look like

To define the goals scored and conceded we will code:

"""
We will go through our list of scores per round
and calculate the expected value of goals scored
and conceded for each round,
we will predict with these values and
then we will check if these values correspond
to the result that happened in the match.
"""
for round in range(len(matchesScores)):
    goals_scored.append(matchesScores[round][0])
    goals_conceded.append(matchesScores[round][1])

    # Now we will get the frequency of the number of goals scored so far
    num_goals, freq_num_goals = np.unique(goals_scored, return_counts=True)
    # For organizational reasons, we will transform our values into a dictionary 'goals': frequency
    dic_goals_scored = dict(zip(num_goals, freq_num_goals))

    # We wil do the same with the goals conceded
    num_goals, freq_num_goals = np.unique(goals_conceded, return_counts=True)
    # For organizational reasons, we will transform our values into a dictionary 'goals': frequency
    dic_goals_conceded = dict(zip(num_goals, freq_num_goals))
Enter fullscreen mode Exit fullscreen mode

After that, we will calculate the expected value of the goals, that is, the values that are expected in the next match considering the values of the previous rounds. To calculate this value we will multiply all the values in the dictionary (number of goals scored) by their probability of occurrence (Frequency divided by the number of rounds) getting then our expected values.

    expected_scored=0
    for goal in dic_goals_scored.keys():
        expected_scored += goal*(dic_goals_scored[goal]/len(goals_scored))

    expected_conceded=0 
    for goal in dic_goals_conceded:
        expected_conceded += goal*(dic_goals_conceded[goal]/len(goals_conceded))
Enter fullscreen mode Exit fullscreen mode

After calculating our expected values, we will print our prediction and compare it with the result of the next round to see if we got the result of the match, the number of goals scored and the number of goals conceded right with our prediction.

    # After calculating our prediction we will print it and compare to the real result

    # The next line will round our values to the closest integer
    expected_scored = int(np.around(expected_scored))
    expected_conceded = int(np.around(expected_conceded))

    """
    If we are in the last round we have no future round
    to predict so we will stop our iteration
    """
    if (round+1 == len(matchesScores)):
        break
    """
    Now we will print our expected value for the next round
     as lists start at number 0 we have to add
     1 to the round value to get the round currently being read,
     that is, we have to add 2 to the number of the `round`
     to get the value of the NEXT round.
    """
    print(f'At the {round+2} round we predicted a result of Ajax  {expected_scored} x {expected_conceded} opponent')
    print(f'At the {round+2} we got a result of Ajax  {matchesScores[round+1][0]} x {matchesScores[round+1][1]} opponent')

    # We will check the results
    if(expected_scored==matchesScores[round+1][0] and expected_conceded==matchesScores[round+1][1]):
        right_round += 1
    if(expected_scored==matchesScores[round+1][0]):
        right_goals_scored += 1
    if(expected_conceded==matchesScores[round+1][1]):
        right_goals_conceded += 1
Enter fullscreen mode Exit fullscreen mode

After the loop execution, we will check our number of right guesses.

# We Will print the results
print("We got {0:1d} of the matches results right, this is, {1:2.2f}%".format(right_round, (right_round/33)*100))

print("We got {0:1d} of the goals scored in a match right, this is, {1:2.2f}%".format(right_goals_scored, (right_goals_scored/33)*100))

print("We got {0:1d} of the goals conceded in a match right, this is, {1:2.2f}%".format(right_goals_conceded, (right_goals_conceded/33)*100))

Enter fullscreen mode Exit fullscreen mode

The output of our program will look like this

> At the 2 round we predicted a result of Ajax  1 x 1 opponent
> At the 2 we got a result of Ajax  1 x 0 opponent
...
> At the 34 round we predicted a result of Ajax  3 x 1 opponent
> At the 34 we got a result of Ajax  4 x 1 opponent
> We got 4 of the matches results right, this is, 12.12%
> We got 7 of the goals scored in a match right, this is, 21.21%
> We got 15 of the goals conceded in a match right, this is, 45.45%
Enter fullscreen mode Exit fullscreen mode

Note that we got 4 results right from a complete match, 8 times more than using random values, 7 predictions of goals scored, 2 times more, and 15 predictions of goals conceded, 3 times more.

The use of expected values helped a lot to improve our number of correct guesses. This shows how powerful simple concepts of probability and statistics can be in data analysis.

The program developed in this article is available in my gitlab repository. I hope I have helped you in any way, if you have any problems or questions feel free to leave a comment on this post or send me an email;).

. . . . . . . . . . . .