How To / Python: Calculate Cosine Distance II/II

This is the second part of this post.

Suppose now that we have incomplete information for each of the countries. Or suppose we just have some elements equal to zero and instead of listing them we omit them. Therefore, now we do not have vectors of the same length (i.e. indexed in the exact same way).

For example, we want to calculate the cosine distance between Argentina and Chile and the vectors are:

country, var, value
Chile, d1, 1.17
Chile, d2, 0.68
Chile, d4, 1.43
Chile, d6, 1.37
Argentina, d3, -0.02
Argentina, d4, -0.69
Argentina, d5, -0.83
Argentina, d6, -0.45

Note that now the data is in a long format. The previous post used data in a wide format.  I transform the data in line 37 in the code below.

Here you can see that Chile does not have rows for variables d3 and d5. Argentina does not have rows d1 and d2. Therefore, it gets a bit tricky if we want to use the Cosine function from SciPy. In the code below I define two functions to get around this and manually calculate the cosine distance.

Function mynorm calculates the norm of the vector. Function mydotprod calculates the dot product between two vectors using pd.merge. I use pd.merge in order to get around the fact that Argentina and Chile do not have the exact same vectors. Then, I make two merges to get the final set of elements that both Argentina and Chile share.

import pandas as pd
from math import sqrt

#Function to calculate norm of vectors
def mynorm(table):
    elements = table['value'].sort_values(ascending = False)
    vector_elements = [(value)**2 for value in elements]
    norm = sqrt(sum(vector_elements))
    return norm

#Function to calculate the dot product of vectors using pd.merge
def mydotprod(a,b):
    dfa = df2[(df2.country == a)][['var','value']]
    dfb = df2[(df2.country == b)][['var','value']]
    mergeddf = dfa.merge(dfb, how = 'inner', on = 'var')
    mergeddf['prod'] = mergeddf['value_x']*mergeddf['value_y']
    dotprod = float(mergeddf['prod'].sum())
    return dotprod

datadict = {'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
            'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
            'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
            'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
            'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
            'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
            'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {'country1': ['Argentina', 'Ecuador'],
             'country2': ['Chile', 'Colombia']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict)

print(df)
print(pairs)

df1 = pd.melt(df, id_vars=['country'], var_name='var', value_name='value')
df2 = df1[(df1['country'] == 'Chile') & (df1['var'] != 'd3') & (df1['var'] != 'd5')]
df2 = df2.append(df1[(df1['country'] == 'Argentina') & (df1['var'] != 'd1') & (df1['var'] != 'd2')])
df2 = df2.append(df1[(df1['country'] == 'Ecuador') | (df1['country'] == 'Colombia')])

#Group variable by country in order to calculate the norm of the country's vector
df3 = df2.groupby(['country'])
dfnorm = pd.DataFrame(df3.apply(mynorm)).reset_index()
dfnorm.rename(columns={0: 'norm'}, inplace = True)

#Add the norm values to the DataFrame containing the pairs of countries
df4 = pairs.merge(dfnorm, how = 'left', left_on = 'country1' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm']]
df4 = df4.merge(dfnorm, how = 'left', left_on = 'country2' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm_x', 'norm_y']]

#Calculate denominator and then apply the mydotprod function to obtain the dot product
df4['denom'] = df4['norm_x'] * df4['norm_y']
df4['dotprod'] = df4.apply(lambda row: round(mydotprod(row['country1'], row['country2']),2), axis=1)
df4['dist'] = 1 - (df4['dotprod'] / df4['denom'])

 

In lines 38-40 I modified the original data from the previous post so I now have the data I show at the beginning of this post (i.e. incomplete data for Argentina and Chile).

In lines 43-45 I calculate the norm of the countries’ vectors. I group by country and then apply mynorm function.

In lines 48-51 I add the norm to the pairs of countries I want to compare.

In line 54 I calculate the denominator of the formula (multiplication of both norms). In line 55 I apply mydotprod function to obtain the dot product. Finally, in line 56 I divide the dot product by the multiplication of the norms, and subtract this value from 1 to obtain the cosine distance (ranging from 0 to 2).

As a result, we get the following table:

country1, country2, norm_x, norm_y, denom, dotprod, dist
Argentina, Chile, 1.169573, 2.398562, 2.805292, -1.60, 1.570351
Ecuador, Colombia, 2.326414, 1.732859, 4.031346, 2.64, 0.345132

Here you can see that the distance between Ecuador and Colombia is the same we got in the previous post (0.35).

Advertisements

One thought on “How To / Python: Calculate Cosine Distance II/II

  1. Pingback: How To / Python: Calculate Cosine Distance I/II | francisco morales

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s