Tag Archives: howto

How To / Python: Get geographic coordinates using Google (Geocode)

Thanks to pygeocoder, obtaining geographic coordinates, or geocoding, from Google is very simple. The Python script below can do that very easily.

from pygeocoder import Geocoder
import pandas as pd

cities = {
'City': ['Coral Gables', 'Hawarden', 'Chapeco', 'Merida', 'Assisi'],
'Nation': ['United States', 'United Kingdom', 'Brazil', 'Spain', 'Italy']}

df = pd.DataFrame(cities)
info = pd.DataFrame([])

for index, row in df.iterrows():

    city = str(row['City'])
    nation = str(row['Nation'])
    location = str(row['City'] + ', ' + row['Nation'])

    result = Geocoder.geocode(location)
    coords = str(result[0].coordinates)

    chars_to_remove = ['(' , ')']
    coords = coords.translate(None, ''.join(chars_to_remove))

    lat = float(coords.split(',')[0])
    long = float(coords.split(',')[1])

    line = pd.Series([city, nation, lat, long])

    info = info.append(line, ignore_index = True)

In line 15 I create the ‘address’ I want to geocode. The format is ‘City, Country’. In line 17, I use pygeocoder to obtain the coordinates. In lines 18-24, I get the values of the latitude and longitude by splitting the result of the geocoder.

The result is a list of cities with their respective latitude and longitude.

Coral Gables, United States, 25.721490, -80.268384
Hawarden, United Kingdom, 53.185690, -3.029807
Chapeco, Brazil, -27.100934, -52.615699
Merida, Spain, 38.939427, -6.365157
Assisi, Italy, 43.070702, 12.619597

If you are dealing with a large amount of cities, you may want to add a couple of lines in order to deal with exceptions (e.g. when pygeocoder cannot find the city). You could use try and except, replacing the lat and long variables with a specific number very different from the values of a geographic coordinate (not pretty but effective!).

You can check Chris Albon’s website. He goes through more details and I probably borrowed from his code to write this script.

Advertisements

How To / Python: Calculate Cosine Distance II/II

This is the second part of this post.

Suppose now that we have incomplete information for each of the countries. Or suppose we just have some elements equal to zero and instead of listing them we omit them. Therefore, now we do not have vectors of the same length (i.e. indexed in the exact same way).

For example, we want to calculate the cosine distance between Argentina and Chile and the vectors are:

country, var, value
Chile, d1, 1.17
Chile, d2, 0.68
Chile, d4, 1.43
Chile, d6, 1.37
Argentina, d3, -0.02
Argentina, d4, -0.69
Argentina, d5, -0.83
Argentina, d6, -0.45

Note that now the data is in a long format. The previous post used data in a wide format.  I transform the data in line 37 in the code below.

Here you can see that Chile does not have rows for variables d3 and d5. Argentina does not have rows d1 and d2. Therefore, it gets a bit tricky if we want to use the Cosine function from SciPy. In the code below I define two functions to get around this and manually calculate the cosine distance.

Function mynorm calculates the norm of the vector. Function mydotprod calculates the dot product between two vectors using pd.merge. I use pd.merge in order to get around the fact that Argentina and Chile do not have the exact same vectors. Then, I make two merges to get the final set of elements that both Argentina and Chile share.

import pandas as pd
from math import sqrt

#Function to calculate norm of vectors
def mynorm(table):
    elements = table['value'].sort_values(ascending = False)
    vector_elements = [(value)**2 for value in elements]
    norm = sqrt(sum(vector_elements))
    return norm

#Function to calculate the dot product of vectors using pd.merge
def mydotprod(a,b):
    dfa = df2[(df2.country == a)][['var','value']]
    dfb = df2[(df2.country == b)][['var','value']]
    mergeddf = dfa.merge(dfb, how = 'inner', on = 'var')
    mergeddf['prod'] = mergeddf['value_x']*mergeddf['value_y']
    dotprod = float(mergeddf['prod'].sum())
    return dotprod

datadict = {'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
            'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
            'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
            'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
            'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
            'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
            'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {'country1': ['Argentina', 'Ecuador'],
             'country2': ['Chile', 'Colombia']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict)

print(df)
print(pairs)

df1 = pd.melt(df, id_vars=['country'], var_name='var', value_name='value')
df2 = df1[(df1['country'] == 'Chile') & (df1['var'] != 'd3') & (df1['var'] != 'd5')]
df2 = df2.append(df1[(df1['country'] == 'Argentina') & (df1['var'] != 'd1') & (df1['var'] != 'd2')])
df2 = df2.append(df1[(df1['country'] == 'Ecuador') | (df1['country'] == 'Colombia')])

#Group variable by country in order to calculate the norm of the country's vector
df3 = df2.groupby(['country'])
dfnorm = pd.DataFrame(df3.apply(mynorm)).reset_index()
dfnorm.rename(columns={0: 'norm'}, inplace = True)

#Add the norm values to the DataFrame containing the pairs of countries
df4 = pairs.merge(dfnorm, how = 'left', left_on = 'country1' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm']]
df4 = df4.merge(dfnorm, how = 'left', left_on = 'country2' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm_x', 'norm_y']]

#Calculate denominator and then apply the mydotprod function to obtain the dot product
df4['denom'] = df4['norm_x'] * df4['norm_y']
df4['dotprod'] = df4.apply(lambda row: round(mydotprod(row['country1'], row['country2']),2), axis=1)
df4['dist'] = 1 - (df4['dotprod'] / df4['denom'])

 

In lines 38-40 I modified the original data from the previous post so I now have the data I show at the beginning of this post (i.e. incomplete data for Argentina and Chile).

In lines 43-45 I calculate the norm of the countries’ vectors. I group by country and then apply mynorm function.

In lines 48-51 I add the norm to the pairs of countries I want to compare.

In line 54 I calculate the denominator of the formula (multiplication of both norms). In line 55 I apply mydotprod function to obtain the dot product. Finally, in line 56 I divide the dot product by the multiplication of the norms, and subtract this value from 1 to obtain the cosine distance (ranging from 0 to 2).

As a result, we get the following table:

country1, country2, norm_x, norm_y, denom, dotprod, dist
Argentina, Chile, 1.169573, 2.398562, 2.805292, -1.60, 1.570351
Ecuador, Colombia, 2.326414, 1.732859, 4.031346, 2.64, 0.345132

Here you can see that the distance between Ecuador and Colombia is the same we got in the previous post (0.35).

How To / Python: Calculate Cosine Distance I/II

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Cosine distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function cosine. This function provides the result of 1 – Cosine Proximity. This means that the results of this function range from 0 to  2, while Cosine Proximity ranges from -1 to 1.

import pandas as pd
from scipy.spatial.distance import cosine

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Venezuela', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Chile', 'Colombia', 'Peru']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict) 

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

cosinedf = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate Cosine distance
cosinedf['cosine_dist'] = cosinedf.apply(lambda x: round(cosine(x['vector1'], x['vector2']),2), axis=1)

cosinedf = cosinedf[['country1', 'country2', 'cosine_dist']]
print(cosinedf)

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 21-22, we add the the 6 variables (d1d6) to each country of the dyad. In lines 25-26 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. Finally, in line 31 we apply the cosine function from SciPy to each pair of countries and we store the result in the new column called cosine_dist.

As a result, we get the following table:

country1, country2, cosine_dist
Argentina, Bolivia, 0.26
Chile, Venezuela, 1.93
Ecuador, Colombia, 0.35
Peru, Peru, 0.00

This piece of code works well when you already have vectors of the same length, indexed by the same index. However, when you have vectors with many elements equal to zero you might have the data in a compressed format. For those cases, we need to take a longer route to calculate the cosine distance. I explain this case in the next post.

How To / Python: Combine multiple CSV files into one

If you have multiple CSV files with the same structure, you can append or combine them using a short Python script. Suppose you have several files which name starts with datayear. For instance, datayear1980.csv, datayear1981.csv, datayear1982.csv.

import pandas as pd
import glob, os

os.chdir("C:/Folder")
results = pd.DataFrame([])

for counter, file in enumerate(glob.glob("datayear*")):
    namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
    results = results.append(namedf)

results.to_csv('C:/combinedfile.csv')

In line 7 you have to specify the structure of the files’ name. Then, in line 8 you can select which columns you want to combine.

How To / Python: Calculate Mahalanobis Distance

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Mahalanobis distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function mahalanobis.

 

import pandas as pd
import scipy as sp
from scipy.spatial.distance import mahalanobis

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Chile', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Venezuela', 'Colombia', 'Peru']}

#DataFrame that contains the data for each country
df = pd.DataFrame(datadict)

#DataFrame that contains the pairs for which we calculate the Mahalanobis distance
pairs = pd.DataFrame(pairsdict)

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

mahala = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate covariance matrix
covmx = df.cov()
invcovmx = sp.linalg.inv(covmx)

#Calculate Mahalanobis distance
mahala['mahala_dist'] = mahala.apply(lambda x: (mahalanobis(x['vector1'], x['vector2'], invcovmx)), axis=1)

mahala = mahala[['country1', 'country2', 'mahala_dist']]

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 25-26, we add the the 6 variables (d1d6) to each country of the dyad. In lines 29-30 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. In lines 35-36 we calculate the inverse of the covariance matrix, which is required to calculate the Mahalanobis distance. Finally, in line 39 we apply the mahalanobis function from SciPy to each pair of countries and we store the result in the new column called mahala_dist.

As a result, we get the following table:

country1, country2, mahala_dist
Argentina, Bolivia, 3.003186
Chile, Venezuela, 3.829020
Ecuador, Colombia, 3.915868
Peru, Peru, 0.000000

 

How To: Deal with Journal Rejections Thinking About the Easter Island

The Easter Island is a very enigmatic place. If you do a brief research on the Island, you would realize that there are many different views about the history of the island. It has been well established, however, that the construction of the Moais stopped in the late 18th century. Why? I think we do not know.

I visited the Island after my first year in the Ph.D. At that time, I already had a journal rejection (or maybe even more than one). Although short, this academic experience led me to observe some of the things you can see in the Island with a surprisingly creative view (maybe too creatively).

After riding a bike for a couple of hours (see my post about biking around the Island), I made it to the quarry were the Moais were built. Although you can see many pictures before landing on the Island, it is still really interesting to walk around dozens of Moais lying on the ground in different positions. I walked around the external side of the volcano Rano Raraku and then I walked into the crater. When I was coming back from the crater, I ran into some rats and horses. I had to wait until the horses decided to move from the trail so I could keep going back to the entrance of the Park.

Rano Raraku, Easter Island

Rano Raraku Volcano.

Before getting to the entrance, I ran into a special Moai. The Moai was (and it is probably still like that) lying horizontally. However, the interesting part was that the Moai was not a single piece of rock anymore. The statue was broken into several pieces.

brokenmoai

Broken Moais.

Looking at the Moai I got what I would describe as an interesting piece of wisdom from the Easter Island. In a less elaborated way, I thought:

“The craftsmen that carved this statue spent a long time working on it. Unfortunately, this Moai got completely destroyed and useless after being broken off the wall of the volcano. Could they repair it? It does not seem they could. You, as a researcher, can also spend a long time ‘carving’ a research paper. However, and fortunately, when it is rejected from a journal, the paper is not broken for good. You actually can repair it and try again.”

So, yes, fortunately, as a researcher you do not end up with a broken Moai when a paper is rejected from a journal. And if the Rapa Nui (native Polynesian inhabitants of Easter Island) kept carving Moais, you definitely have to keep trying to get your paper published.

I think it works.

If you ever meet me and hear me saying “well, it is not a Moai”, then you will know what just happened.

How To / STATA: Draw a Random Sample from Panel Data

Assume we have a data set containing firm data across years. The variable id uniquely identify a firm. The variable performance is some kind of financial performance of the firm and the variable year indicates when that performance happened. Thus,  we have a small panel where firm-year is the unit of analysis.

If you want to draw a random sample from a data set like that, you shouldn’t directly use the command –sample-. If you use it, then you will lose the panel structure of the data (or at very least you are very likely to lose it!). What you should do instead is to randomly select firm ids and then keep all the observations (all years) for each of the randomly selected firm ids. Below you can see an example of a STATA code to perform this operation. Remember we have three variables: id, year, performance.

use "yourdataset.dta", replace

tempfile paneldata
save `paneldata'

collapse (mean) performance, by(id)
keep id
sample 50

tempfile randomsampleid
save `randomsampleid'

use `paneldata'

merge m:1 id using `randomsampleid'

drop if _merge == 1
drop _merge

After opening the data set, we save a temporary file called paneldata (lines 3-4). Then we get rid of the repeated ids using –collapse– and then we drop all the variables and we keep only id (lines 6-7). In line 8 we use the command –sample– so STATA randomly select, ins this case, a 50% of the total number of unique ids (-help sample– to see other options, such as defining the number of observations you want to draw from the original set). In lines 10-11 we save this subset of ids in a temporary file called randomsampleid.

Finally, we return to the panel data (line 13) and then we merge it using the randomsampleid. It is a m:1 merge because in the panel data the id variable does not uniquely identify each observation but it does that in the using data. Those observations that are successfully merged are the ones that STATA randomly chose for you, so we get rid of the rest in line 17.