How To / STATA: Calculate Variables for Groups of Observations

In management research, we usually need to create a variable that measures the experience of firms. Firms accumulate experience as they make acquisitions or invest in companies in certain countries. Sometimes this experience has an effect in future decisions, so we calculate variables that measure the number of times a firm has made an acquisition or has invested in a certain industry or country. In STATA, this can be done using the command –bysort– and –gen– (i.e. –generate-) or –egen-. In this post I will calculate an experience variable using a fictitious dataset.

Consider the dataset shown in the figure below (Table 1). It has 4 variables: Firm, Country, Year, and Investments. The dataset describes the amount of investments a Firm has made in a Country each Year. The variable Investments is non-negative.

Table 1

In this case, firm’s experience is the number of investments a firm has made in a certain country until a certain year. Therefore, in order to create the variable ExperienceCountry (column 5 in Table 2) we have to sum the investments a firm has made in a country from the first to the current year. We do this using the –bysort– and –gen– commands.

```bysort  Firm Country (Year): gen ExperienceCountry = sum(Investments)
```

The –bysort– command sort the observations using the variables Firm, Country, and Year. The result of that order will be two groups of observations: Firm A and Firm B. Then, since we are sorting by Country, we will have two subgroups within each group: Brazil and Russia. Finally, within each country, the observations will be sorted from the first to the last year.

After sorting the observations we use –gen– to calculate the ExperienceCountry variable. What we do is that for each row (observation) we sum the number of investments previously made, including those of the current observation. Thus, we get a 4 in row 3 because Firm A invested one time in 2000, one time in 2001 and twice in 2002. But why we write the variable Year between parenthesis? This means that even though the –bysort– will sort the observations by the year, the operation performed after the –bysort– (-gen– and –sum– in this case) will be done by the groups formed by the values of the variables Firm and Country. In other words, if we write:

```bysort  Firm Country Year: gen NotExperienceCountry = sum(Investments)
```

We will be summing the investments made by each firm in each country during (only) each year. Given that we only have one observation for each year per each pair Firm-Country, we will only get the same value of the variable Investments.

What if we want to calculate the total number of investments a firm made in a Country across all years? We can do this using either the –gen– or –egen– commands. If we have already calculated the variable ExperienceCountry using –gen– command as shown above, then we can do the following:

```bysort  Firm Country (Year): egen  TotalInvestmentsCountry1 = max(ExperienceCountry)
```

In this case, we are calculating the maximum value of the variable ExperienceCountry for every pair Firm-Country across all years. If, however, we have not calculated the ExperienceCountry variable, then we can do just the following:

```bysort  Firm Country (Year): egen TotalInvestmentsCountry2 = sum(Investments)
```

In this case, we are summing all the investments made by a Firm in each Country across all years, and then we are creating the variable TotalInvestmentsCountry2 that has the same value for every row of each unique pair Firm-Country.

The final result is as follows:

Table 2

If you want to play with this dataset, you can download and MS Excel file (.xls) with the data from here.