This is the second part of this post.

Suppose now that we have incomplete information for each of the countries. Or suppose we just have some elements equal to zero and instead of listing them we omit them. Therefore, now we do not have vectors of the same length (i.e. indexed in the exact same way).

For example, we want to calculate the cosine distance between Argentina and Chile and the vectors are:

country, var, value Chile, d1, 1.17 Chile, d2, 0.68 Chile, d4, 1.43 Chile, d6, 1.37 Argentina, d3, -0.02 Argentina, d4, -0.69 Argentina, d5, -0.83 Argentina, d6, -0.45

Note that now the data is in a long format. The previous post used data in a wide format. I transform the data in line 37 in the code below.

Here you can see that Chile does not have rows for variables *d3* and *d5*. Argentina does not have rows *d1* and *d2*. Therefore, it gets a bit tricky if we want to use the Cosine function from SciPy. In the code below I define two functions to get around this and manually calculate the cosine distance.

Function *mynorm* calculates the norm of the vector. Function *mydotprod* calculates the dot product between two vectors using *pd.merge*. I use *pd.merge* in order to get around the fact that Argentina and Chile do not have the exact same vectors. Then, I make two merges to get the final set of elements that both Argentina and Chile share.

import pandas as pd from math import sqrt #Function to calculate norm of vectors def mynorm(table): elements = table['value'].sort_values(ascending = False) vector_elements = [(value)**2 for value in elements] norm = sqrt(sum(vector_elements)) return norm #Function to calculate the dot product of vectors using pd.merge def mydotprod(a,b): dfa = df2[(df2.country == a)][['var','value']] dfb = df2[(df2.country == b)][['var','value']] mergeddf = dfa.merge(dfb, how = 'inner', on = 'var') mergeddf['prod'] = mergeddf['value_x']*mergeddf['value_y'] dotprod = float(mergeddf['prod'].sum()) return dotprod datadict = {'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'], 'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61], 'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39], 'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02], 'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08], 'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22], 'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]} pairsdict = {'country1': ['Argentina', 'Ecuador'], 'country2': ['Chile', 'Colombia']} df = pd.DataFrame(datadict) pairs = pd.DataFrame(pairsdict) print(df) print(pairs) df1 = pd.melt(df, id_vars=['country'], var_name='var', value_name='value') df2 = df1[(df1['country'] == 'Chile') & (df1['var'] != 'd3') & (df1['var'] != 'd5')] df2 = df2.append(df1[(df1['country'] == 'Argentina') & (df1['var'] != 'd1') & (df1['var'] != 'd2')]) df2 = df2.append(df1[(df1['country'] == 'Ecuador') | (df1['country'] == 'Colombia')]) #Group variable by country in order to calculate the norm of the country's vector df3 = df2.groupby(['country']) dfnorm = pd.DataFrame(df3.apply(mynorm)).reset_index() dfnorm.rename(columns={0: 'norm'}, inplace = True) #Add the norm values to the DataFrame containing the pairs of countries df4 = pairs.merge(dfnorm, how = 'left', left_on = 'country1' , right_on = 'country') df4 = df4[['country1', 'country2', 'norm']] df4 = df4.merge(dfnorm, how = 'left', left_on = 'country2' , right_on = 'country') df4 = df4[['country1', 'country2', 'norm_x', 'norm_y']] #Calculate denominator and then apply the mydotprod function to obtain the dot product df4['denom'] = df4['norm_x'] * df4['norm_y'] df4['dotprod'] = df4.apply(lambda row: round(mydotprod(row['country1'], row['country2']),2), axis=1) df4['dist'] = 1 - (df4['dotprod'] / df4['denom'])

In lines 38-40 I modified the original data from the previous post so I now have the data I show at the beginning of this post (i.e. incomplete data for Argentina and Chile).

In lines 43-45 I calculate the norm of the countries’ vectors. I group by country and then apply *mynorm* function.

In lines 48-51 I add the norm to the pairs of countries I want to compare.

In line 54 I calculate the denominator of the formula (multiplication of both norms). In line 55 I apply *mydotprod* function to obtain the dot product. Finally, in line 56 I divide the dot product by the multiplication of the norms, and subtract this value from 1 to obtain the cosine distance (ranging from 0 to 2).

As a result, we get the following table:

country1, country2, norm_x, norm_y, denom, dotprod, dist Argentina, Chile, 1.169573, 2.398562, 2.805292, -1.60, 1.570351 Ecuador, Colombia, 2.326414, 1.732859, 4.031346, 2.64, 0.345132

Here you can see that the distance between Ecuador and Colombia is the same we got in the previous post (0.35).

Pingback: How To / Python: Calculate Cosine Distance I/II | francisco morales