I'd like to get some help understanding vectorized operations on multi-dimensional arrays. Specifically, I've got a problem and some code that I think should work, but it's not working, and I'm sure it's because my thinking is wrong, but I can't figure out why.
Some caveats:
- This is for some homework. I really don't want a plop of code that I'm supposed to copy/paste without understanding. If I wanted that, I'd go to StackOverflow. I want the concepts.
- I want to do this only using
numpy
. I know thatscipy
and other ML libraries have fancy functions that would do what I'm asking about in a black box, but that's not what I want. This is a learning exercise.
Here's the scenario:
The Scenario
I've got two datasets of Iris data (yes, that Iris data)--a training set and a test set. Both sets have 4 columns of float values, and an associated vector of labels classifying each of the data points.
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
7,3.2,4.7,1.4,Iris-versicolor
...
We're doing a 1-Nearest-Neighbor classification. The goal is to do the following:
- For each data point in the testing set, compare it to all the points in the training set by calculating the "distance" between the two. Distance is calculated as
def distance(x, y):
return math.sqrt((x[0] - y[0])**2 + (x[1] - y[1])**2 + (x[2] - y[2])**2 + (x[3] - y[3])**2)
Also known as the Root-Sum-Square of the differences between each feature of each point.
Now. Here's what I have right now:
My Code
import numpy as np
def distance(x, y):
return np.sqrt(np.sum((x - y)**2, axis=1))
def main():
# ... blah blah load data
# training_data is 75 rows x 4 cols of floats
# testing_data is 75 rows x 4 cols of floats
# training_labels is 75 rows x 1 col of strings
# testing_labels is 75 rows x 1 col of strings
# My thought is to use "broadcasting" to do it without loops
# so far, to me, "broadcasting" == "magic"
training_data = training_data.reshape((1, 4, 75))
testing_data = testing_data.reshape((75, 4, 1))
# So this next bit should work like magic, producing a 75 x 1 x 75 matrix of
# distances between the testing data (row indices) and the training data
# (column indices)
distances = distance(testing_data, training_data)
# And the column index of the minimum distance should in theory be the
# index of the training point that is the "closest" to the given testing point
# for that row
closest_indices = distances.argmin(axis=1)
# And this should build an array of labels corresponding to the indices
# gathered above
predicted_labels = training_labels[closest_indices]
number_correct = np.sum(predicted_labels == testing_labels)
accuracy = number_correct/len(testing_labels)
And this all seems right to me.
But.
When I run it, per my prompt, I should be expecting an accuracy somewhere in the .94 range, and I'm getting something in the .33 range. Which is poop.
So. What am I missing? What key concepts am I totally misunderstanding?
Thank you!