Monthly Archives: March 2017

How To / Python: Get geographic coordinates using Google (Geocode)

Thanks to pygeocoder, obtaining geographic coordinates, or geocoding, from Google is very simple. The Python script below can do that very easily.

from pygeocoder import Geocoder
import pandas as pd

cities = {
'City': ['Coral Gables', 'Hawarden', 'Chapeco', 'Merida', 'Assisi'],
'Nation': ['United States', 'United Kingdom', 'Brazil', 'Spain', 'Italy']}

df = pd.DataFrame(cities)
info = pd.DataFrame([])

for index, row in df.iterrows():

    city = str(row['City'])
    nation = str(row['Nation'])
    location = str(row['City'] + ', ' + row['Nation'])

    result = Geocoder.geocode(location)
    coords = str(result[0].coordinates)

    chars_to_remove = ['(' , ')']
    coords = coords.translate(None, ''.join(chars_to_remove))

    lat = float(coords.split(',')[0])
    long = float(coords.split(',')[1])

    line = pd.Series([city, nation, lat, long])

    info = info.append(line, ignore_index = True)

In line 15 I create the ‘address’ I want to geocode. The format is ‘City, Country’. In line 17, I use pygeocoder to obtain the coordinates. In lines 18-24, I get the values of the latitude and longitude by splitting the result of the geocoder.

The result is a list of cities with their respective latitude and longitude.

Coral Gables, United States, 25.721490, -80.268384
Hawarden, United Kingdom, 53.185690, -3.029807
Chapeco, Brazil, -27.100934, -52.615699
Merida, Spain, 38.939427, -6.365157
Assisi, Italy, 43.070702, 12.619597

If you are dealing with a large amount of cities, you may want to add a couple of lines in order to deal with exceptions (e.g. when pygeocoder cannot find the city). You could use try and except, replacing the lat and long variables with a specific number very different from the values of a geographic coordinate (not pretty but effective!).

You can check Chris Albon’s website. He goes through more details and I probably borrowed from his code to write this script.

Advertisements

How To / Python: Calculate Cosine Distance II/II

This is the second part of this post.

Suppose now that we have incomplete information for each of the countries. Or suppose we just have some elements equal to zero and instead of listing them we omit them. Therefore, now we do not have vectors of the same length (i.e. indexed in the exact same way).

For example, we want to calculate the cosine distance between Argentina and Chile and the vectors are:

country, var, value
Chile, d1, 1.17
Chile, d2, 0.68
Chile, d4, 1.43
Chile, d6, 1.37
Argentina, d3, -0.02
Argentina, d4, -0.69
Argentina, d5, -0.83
Argentina, d6, -0.45

Note that now the data is in a long format. The previous post used data in a wide format.  I transform the data in line 37 in the code below.

Here you can see that Chile does not have rows for variables d3 and d5. Argentina does not have rows d1 and d2. Therefore, it gets a bit tricky if we want to use the Cosine function from SciPy. In the code below I define two functions to get around this and manually calculate the cosine distance.

Function mynorm calculates the norm of the vector. Function mydotprod calculates the dot product between two vectors using pd.merge. I use pd.merge in order to get around the fact that Argentina and Chile do not have the exact same vectors. Then, I make two merges to get the final set of elements that both Argentina and Chile share.

import pandas as pd
from math import sqrt

#Function to calculate norm of vectors
def mynorm(table):
    elements = table['value'].sort_values(ascending = False)
    vector_elements = [(value)**2 for value in elements]
    norm = sqrt(sum(vector_elements))
    return norm

#Function to calculate the dot product of vectors using pd.merge
def mydotprod(a,b):
    dfa = df2[(df2.country == a)][['var','value']]
    dfb = df2[(df2.country == b)][['var','value']]
    mergeddf = dfa.merge(dfb, how = 'inner', on = 'var')
    mergeddf['prod'] = mergeddf['value_x']*mergeddf['value_y']
    dotprod = float(mergeddf['prod'].sum())
    return dotprod

datadict = {'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
            'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
            'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
            'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
            'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
            'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
            'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {'country1': ['Argentina', 'Ecuador'],
             'country2': ['Chile', 'Colombia']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict)

print(df)
print(pairs)

df1 = pd.melt(df, id_vars=['country'], var_name='var', value_name='value')
df2 = df1[(df1['country'] == 'Chile') & (df1['var'] != 'd3') & (df1['var'] != 'd5')]
df2 = df2.append(df1[(df1['country'] == 'Argentina') & (df1['var'] != 'd1') & (df1['var'] != 'd2')])
df2 = df2.append(df1[(df1['country'] == 'Ecuador') | (df1['country'] == 'Colombia')])

#Group variable by country in order to calculate the norm of the country's vector
df3 = df2.groupby(['country'])
dfnorm = pd.DataFrame(df3.apply(mynorm)).reset_index()
dfnorm.rename(columns={0: 'norm'}, inplace = True)

#Add the norm values to the DataFrame containing the pairs of countries
df4 = pairs.merge(dfnorm, how = 'left', left_on = 'country1' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm']]
df4 = df4.merge(dfnorm, how = 'left', left_on = 'country2' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm_x', 'norm_y']]

#Calculate denominator and then apply the mydotprod function to obtain the dot product
df4['denom'] = df4['norm_x'] * df4['norm_y']
df4['dotprod'] = df4.apply(lambda row: round(mydotprod(row['country1'], row['country2']),2), axis=1)
df4['dist'] = 1 - (df4['dotprod'] / df4['denom'])

 

In lines 38-40 I modified the original data from the previous post so I now have the data I show at the beginning of this post (i.e. incomplete data for Argentina and Chile).

In lines 43-45 I calculate the norm of the countries’ vectors. I group by country and then apply mynorm function.

In lines 48-51 I add the norm to the pairs of countries I want to compare.

In line 54 I calculate the denominator of the formula (multiplication of both norms). In line 55 I apply mydotprod function to obtain the dot product. Finally, in line 56 I divide the dot product by the multiplication of the norms, and subtract this value from 1 to obtain the cosine distance (ranging from 0 to 2).

As a result, we get the following table:

country1, country2, norm_x, norm_y, denom, dotprod, dist
Argentina, Chile, 1.169573, 2.398562, 2.805292, -1.60, 1.570351
Ecuador, Colombia, 2.326414, 1.732859, 4.031346, 2.64, 0.345132

Here you can see that the distance between Ecuador and Colombia is the same we got in the previous post (0.35).

How To / Python: Calculate Cosine Distance I/II

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Cosine distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function cosine. This function provides the result of 1 – Cosine Proximity. This means that the results of this function range from 0 to  2, while Cosine Proximity ranges from -1 to 1.

import pandas as pd
from scipy.spatial.distance import cosine

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Venezuela', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Chile', 'Colombia', 'Peru']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict) 

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

cosinedf = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate Cosine distance
cosinedf['cosine_dist'] = cosinedf.apply(lambda x: round(cosine(x['vector1'], x['vector2']),2), axis=1)

cosinedf = cosinedf[['country1', 'country2', 'cosine_dist']]
print(cosinedf)

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 21-22, we add the the 6 variables (d1d6) to each country of the dyad. In lines 25-26 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. Finally, in line 31 we apply the cosine function from SciPy to each pair of countries and we store the result in the new column called cosine_dist.

As a result, we get the following table:

country1, country2, cosine_dist
Argentina, Bolivia, 0.26
Chile, Venezuela, 1.93
Ecuador, Colombia, 0.35
Peru, Peru, 0.00

This piece of code works well when you already have vectors of the same length, indexed by the same index. However, when you have vectors with many elements equal to zero you might have the data in a compressed format. For those cases, we need to take a longer route to calculate the cosine distance. I explain this case in the next post.

How To / Python: Combine multiple CSV files into one

If you have multiple CSV files with the same structure, you can append or combine them using a short Python script. Suppose you have several files which name starts with datayear. For instance, datayear1980.csv, datayear1981.csv, datayear1982.csv.

import pandas as pd
import glob, os

os.chdir("C:/Folder")
results = pd.DataFrame([])

for counter, file in enumerate(glob.glob("datayear*")):
    namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
    results = results.append(namedf)

results.to_csv('C:/combinedfile.csv')

In line 7 you have to specify the structure of the files’ name. Then, in line 8 you can select which columns you want to combine.

How To / Python: Calculate Mahalanobis Distance

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Mahalanobis distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function mahalanobis.

 

import pandas as pd
import scipy as sp
from scipy.spatial.distance import mahalanobis

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Chile', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Venezuela', 'Colombia', 'Peru']}

#DataFrame that contains the data for each country
df = pd.DataFrame(datadict)

#DataFrame that contains the pairs for which we calculate the Mahalanobis distance
pairs = pd.DataFrame(pairsdict)

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

mahala = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate covariance matrix
covmx = df.cov()
invcovmx = sp.linalg.inv(covmx)

#Calculate Mahalanobis distance
mahala['mahala_dist'] = mahala.apply(lambda x: (mahalanobis(x['vector1'], x['vector2'], invcovmx)), axis=1)

mahala = mahala[['country1', 'country2', 'mahala_dist']]

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 25-26, we add the the 6 variables (d1d6) to each country of the dyad. In lines 29-30 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. In lines 35-36 we calculate the inverse of the covariance matrix, which is required to calculate the Mahalanobis distance. Finally, in line 39 we apply the mahalanobis function from SciPy to each pair of countries and we store the result in the new column called mahala_dist.

As a result, we get the following table:

country1, country2, mahala_dist
Argentina, Bolivia, 3.003186
Chile, Venezuela, 3.829020
Ecuador, Colombia, 3.915868
Peru, Peru, 0.000000