How To / Python: Calculate Cosine Distance I/II

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Cosine distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function cosine. This function provides the result of 1 – Cosine Proximity. This means that the results of this function range from 0 to  2, while Cosine Proximity ranges from -1 to 1.

import pandas as pd
from scipy.spatial.distance import cosine

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Venezuela', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Chile', 'Colombia', 'Peru']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict) 

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

cosinedf = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate Cosine distance
cosinedf['cosine_dist'] = cosinedf.apply(lambda x: round(cosine(x['vector1'], x['vector2']),2), axis=1)

cosinedf = cosinedf[['country1', 'country2', 'cosine_dist']]

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 21-22, we add the the 6 variables (d1d6) to each country of the dyad. In lines 25-26 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. Finally, in line 31 we apply the cosine function from SciPy to each pair of countries and we store the result in the new column called cosine_dist.

As a result, we get the following table:

country1, country2, cosine_dist
Argentina, Bolivia, 0.26
Chile, Venezuela, 1.93
Ecuador, Colombia, 0.35
Peru, Peru, 0.00

This piece of code works well when you already have vectors of the same length, indexed by the same index. However, when you have vectors with many elements equal to zero you might have the data in a compressed format. For those cases, we need to take a longer route to calculate the cosine distance. I explain this case in the next post.

1 thought on “How To / Python: Calculate Cosine Distance I/II

  1. Pingback: How To / Python: Calculate Cosine Distance II/II | francisco morales

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s