Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Cosine distance between the countries. Here you can find a Python code to do just that.
In this code, I use the SciPy library to take advantage of the built-in function cosine. This function provides the result of 1 – Cosine Proximity. This means that the results of this function range from 0 to 2, while Cosine Proximity ranges from -1 to 1.
import pandas as pd from scipy.spatial.distance import cosine datadict = { 'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'], 'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61], 'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39], 'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02], 'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08], 'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22], 'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]} pairsdict = { 'country1': ['Argentina', 'Venezuela', 'Ecuador', 'Peru'], 'country2': ['Bolivia', 'Chile', 'Colombia', 'Peru']} df = pd.DataFrame(datadict) pairs = pd.DataFrame(pairsdict) #Add data to the country pairs pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country']) pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country']) #Convert data columns to list in a single cell pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist() pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist() cosinedf = pairs[['country1', 'country2', 'vector1', 'vector2']] #Calculate Cosine distance cosinedf['cosine_dist'] = cosinedf.apply(lambda x: round(cosine(x['vector1'], x['vector2']),2), axis=1) cosinedf = cosinedf[['country1', 'country2', 'cosine_dist']] print(cosinedf)
The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.
In lines 21-22, we add the the 6 variables (d1–d6) to each country of the dyad. In lines 25-26 we convert the 6 columns to one column containing a list with the 6 values of variables d1–d6. Finally, in line 31 we apply the cosine function from SciPy to each pair of countries and we store the result in the new column called cosine_dist.
As a result, we get the following table:
country1, country2, cosine_dist Argentina, Bolivia, 0.26 Chile, Venezuela, 1.93 Ecuador, Colombia, 0.35 Peru, Peru, 0.00
This piece of code works well when you already have vectors of the same length, indexed by the same index. However, when you have vectors with many elements equal to zero you might have the data in a compressed format. For those cases, we need to take a longer route to calculate the cosine distance. I explain this case in the next post.
Pingback: How To / Python: Calculate Cosine Distance II/II | francisco morales