March 31, 2011

Information Criterion in STATA

As illustrated from yesterday's exercise, you might find yourself in a situation where you will wonder how many lags do you use when you come up with an autoregression (AR) model. This is an important issue in economic modeling because, as much as we like to put more variables in a model to capture realistically the behavior of the dependent variable, introducing more variables also introduce more errors. This modeling philosophy of parsimony is popularized by Box and Jenkins (1976, Time Series Analysis: Forecasting and Control, Holden-Day). So Box and Jenkins advocated using as few parameters as possible in modeling (popularity of the parsimony philosophy that time also coincided with the famous Robert Lucas' critique).

There are many lag-order selection statistics, or information criteria, that are out there. The most famous four are:

1. Final prediction error (FPE) created by Hirotsugu Akaike (1969, "Fitting autoregressive models for regression," Annals of the Institute of Statistical Mathematics, 21:243-47):


2. Akaike information criterion (AIC) also created by Akaike (1974, "A new look at the statistical model identification," IEEE Transactions on Automatic Control, 19(6):716-23):


3. Bayesian information criterion (BIC) created by Gideon E. Schwarz (1978):


4. Hannan-Quinn information criterion (HQC) created by Edward J. Hannan and B.G Quinn (1979):


For all formulas above, p is the number of AR lags, q is the number of moving average (MA) lags (yes, these statistics are applicable to ARMA models), and Lnn) is the log-likelihood value of the function.

The FPE is used primarily for AR models whereas the last three are for general ARMA models. As you can see, these three are similar in a sense they basically contain two terms: the first term captures the advantages of having more variables in that the model's fit goes up; but the second term captures the disadvantages--this is where the philosophy of parsimony kicks in. In all four, the lowest value indicates the most appropriate number of lags.

As I've shown yesterday, you can easily calculate these statistics in STATA after estimation with the use of the the command VARSOC. But if you want to calaculate these statistics directly in STATA, read on. To illustrate, suppose we simulate an ARMA(2,2) process exactly the same thing we did before:


set seed 1
sim_arma y, arcoef(.66666667 .11111111) macoef(.25 .25) et(e) nobs(600) sigma(1) time(time)


Since we simulated the data this way, we know that the correct model should have two AR lags and two MA lags. So let's check three variants of this model: the correct specification; a specification with only one MA lag; and a specification with three MA lags. So, throughout, we use the correct number of AR lags (that is, two).

We first estimate using the ARIMA command, which uses the maximum likelihood method:


arima y, ar(1/2) ma(1/2) nocons


You just interchange ma(1/2) with ma(1/1) for one lag and ma(1/3) for three lags. Then after each estimation, we calculate the information criteria. To calculate AIC:


di ((-2)*e(ll))+(2*(e(ar_max)+e(ma_max)))


To calculate BIC:


di ((-2)*e(ll))+((e(ar_max)+e(ma_max))*(ln(e(N))))


Finally, to calculate HQC:


di ((-2)*e(ll))+(2*(e(ar_max)+e(ma_max))*(ln(ln(e(N)))))


So these commands simply display results from the estimation stored in STATA. e(ll) is the maximum likelihood value, e(ar_max) is the number of AR lags, e(ma_max) is the number of MA lags, and e(N) is the number of observations. The resulting statistics are compiled in the table below:


Based on the results above, the model with two MA lags have the lowest value in all three criterions, which means that we should use two MA lags. This is not unexpected as the data we simulated do follow a two-lag AR/two-lag MA process.

In closing, it's also easy to create an information criterion of your own provided that your proposed formula should capture two things: advantages of having more variables; and a penalty for having more variables. I even created one of my own. I call it Newey-Akaike information criterion, or NAIC. It's a cool name since the acronym also looks like an anagram of my last name. The other reason for the name is because the criterion I propose is what a think a mix of the AIC and a formula for lag selection parameter I adopted from Newey and West (1994)--and so the "N" is for Newey and West and "A" is for Akaike:


NAIC captures the advantages of having more variables (the first term, which is no different from the others) and the disadvantage of having so (the second term). We can go through the same process again but we use the following command for calculating NAIC:


di ((-2)*e(ll))+(2*(e(ar_max)+e(ma_max))*ln(3*((e(N)/100)^(2/25))))


The result of using this proposed criterion is as follows:


As it turns out, above shows that NAIC is also a valid criterion. NAIC indicates that the appropriate number of MA lags is two, being the lowest value--as it should be.

March 30, 2011

Impulse Response Function in STATA

Impulse response analysis in time series analysis is important in determining the effects of external shocks on the variables of the system. Simply put, an Impulse Response Function (IRF) shows how an unexpected change in one variable at the beginning affects another variable through time. It is so widely applicable that we can use on our previous analysis of the relationship between GDP and oil prices.

It should be emphasized that we are not looking at how one variable (oil prices, for example) affects another variable (GDP, for example). We can easily look at the coefficients to know that. What we are looking for are how unexpected changes that directly affect oil prices affect GDP. In a sense, we are looking at shocks coming from the error term related to oil prices, and how such shocks change GDP.

Now, we're not going to discuss impulse response functions the easy way. Before we go into using STATA to compute the impulse response functions, we're going to look at the econometrics behind it. The formula for an IRF is:

Ψi = ΦiB-1Λ½

where B-1 is the matrix of coefficients of all the variables at time t; Λ½ is the lower Cholesky decomposition of the variance-covariance matrix of et (both Λ and Λ½ are diagonal matrices with zero non-diagonal elements); and, Φi is another matrix that contains the effects of a one-unit increase in innovation at date t (et) on the value of the y variable at time t+s:


For example, if we have two variables, (yt, xt), and we're looking at how the error terms, (eyt, ext), affect each of the two variables, the IRF can be summarized as:


Of course, the elements of the matrix are different at each point in time, as we will see shortly.

Now, let's put numbers behind the IRF. Suppose we are analyzing a vector autoregression (VAR) system. We are interested in the following structural vector autoregression (SVAR):


where zt = (yt,xt)'.

But of course, if we want to estimate an SVAR, we instead estimate an observationally equivalent reduced-form vector autoregression (RFVAR) simply because they're easier to estimate:


where Σ is the variance of the error term of the RFVAR (ε).

Now, since both forms of VAR are equivalent, it should be that:



Assuming invertibility is verified (the matrix of coefficients is nonsingular--it has a determinant), we can derive the series of Φi by looking at the MA(∞) representation of zt:


Finally, for the last element of the IRF (Λ½), we make use of the following formula:


then we apply a Cholesky decomposition to get Λ½. In STATA, we use the CHOLESKY function to derive the Cholesky decomposition of a matrix. For example, given that Λ is:


To derive the Cholesky decomposition in STATA, we simply use the following commands:


matrix a=(4,0\0,3.75)
matrix b=cholesky(a)


The first line is where I input the 2X2 matrix and name it a, and b is the resulting Cholesky decomposition.

Alternatively, we can get Λ½ directly by applying another formula:


Σ½ is the lower Cholesky decomposition of the variance-covariance matrix of the RFVAR error term (εt). Applying the formula, we get:


Now, we can apply the formula to get the IRF. Suppose we want to compute the responses for t = 0, 1, 2:


For example, if we want to know the responses of (yt) to one standard deviation shock in (ext), we get 0 for the first period, -½(15)½ for the second period, and 0 for the third period.

Now, the we have the algebra and the econometrics out of the way, let's look at implementing these in STATA. It's way more simpler than the procedure above.

Let us use the data from our previous GDP-oil price analysis. Using that data (already in first-difference log forms), we run the original VAR command with 4 lags:


Before we proceed, we can check if we do need four lags by obtaining lag-order selection statistics. The STATA command VARSOC shows four information criteria (I'll discuss more of this tomorrow) that shows how many lags are the most appropriate:


It seems that we only need a single lag (check the lags with the *). So, rerunning the VAR with only one lag, we get:


Then as a post-estimation command, we run STATA's IRF command after the VAR estimation:


The first line is needed as STATA needs an active file to where the results of the impulse response analyses are kept.

As the footnotes indicate, the first column displays the responses of GDP to one standard deviation shock in eGDP. The second column show response of oil to a shock in eGDP. For the third and fourth columns, the effects of a shock in eoil on GDP and oil, respectively, are shown. The table shows up to nine time periods (quarters in this case).

The NOCI option is there to suppress reporting of the confidence interval. Of course, you can show the intervals by not including this option. Another option that might be useful for you is STDERROR, which shows the standard errors.

STATA provides a very convenient tool to do impulse response analysis. The IRF command can also create graphs which is useful if you prefer a visual look instead of looking at the numbers.

March 29, 2011

Exclusion Test Using STATA

I haven't been traveling lately as I have been very, very busy. Let me make up by sharing some of the things that have made me busy--STATA stuff. We started yesterday with how to simulate an ARMA sequence. Now, a more practical use, is how to do exclusion test using STATA. An exclusion test is basically an F-test to see whether one or more variables are significant in explaining the dependent variable.

For example, there's an issue of whether oil price shocks have a symmetric impact on GDP growth--that is, do we find that both oil price increases and oil price decreases affect real GDP growth, or is the relationship only significant with an increase in oil prices? Ni, Lee, and Ratti (1995) found that positive normalized shocks have a powerful effect on growth while negative normalized shocks do not. Their results, however, are based on the premise theat oil pirce change is likely to have greater impact on real GNP in an invironment where oil prices have been stable, than in an environment where oil price movement has been frequent and erratic. Then we also find on the other side of the spectrum woks such as Killian and Vigfusson (2009). Using an alternative approach, they find that impulse responses are actually of roughly the same magnitude in either direction of the oil price change. Their result is consistent with formal tests of symmetric responses.

We can do a simple test of the symmetry on our own with the use of STATA. All we need first is data, which I got from the excellent Economic Research Department of the Federal Reserve Bank of St. Louis. We just need quarterly data on real GDP, West Texas Intermediate (WTI) crude oil prices, and the producer price index (PPI). We calculate real crude oil prices by dividing WTI by PPI. We then take the natural log and then take a first difference of these two variables to approximate growth rates (DLRGDP and DLROIL). Finally, to test the symmetry, we create a series consisting of only the positive elements of the oil price changes with negative changes set to zero (DLROILP), and another series consisting of only the negative elements of the oil price changes with positive changes set to zero (DLROILN).

We use a bivariate VAR for our model with 4 quarterly lags. The exclusion tests then proceeds as follows: (1) we estimate the whole model--DLRGDP, DLROILP and DLROILN; (2) save the results in STATA; (3) we do a second estimation, this time excluding either DLROILP or DLROILN; (4) save the second results in STATA; and (5) run the F-test that the excluded lagged variables of DLROILP/DLROILN are indeed not significantly different from zero. The following commands in STATA are used


reg dlrgdp l.dlrgdp l2.dlrgdp l3.dlrgdp l4.dlrgdp l.dlroilp l2.dlroilp l3.dlroilp l4.dlroilp l.dlroiln l2.dlroiln l3.dlroiln l4.dlroiln

est store a

reg dlrgdp l.dlrgdp l2.dlrgdp l3.dlrgdp l4.dlrgdp l.dlroiln l2.dlroiln l3.dlroiln l4.dlroiln

est store b

ftest a b


a and b are arbitrary names I assign to the two estimations. The result of the F-test is as follows:


The exclusion test of real oil price increases is very signficant while that of real oil price decreases is not. These results confirm a positive effect of oil price increases on real GDP but not real oil price decreases. There is asymmetry in this case. Although I didn't present it here, the coefficients of the lagged variables of positive oil price shocks are all negative in the four quarters (significant in the second and fourth quarters), indicating that positive oil price shocks have negative effects on real GDP growth.

We could also check the symmetry of oil price shocks on overall price level. Data on CPI can also be obtained from the St. Louis Fed website. The hypothesis is that establishments are quick to increase the prices of the commodities they sell if they see that world oil price has increased. But if there was a decrease in world oil price, the adjustments in their prices is slow, if there will be changes at all. We can check this empirically by going through the same procedure above--this time having first difference in log CPI as the dependent variable (and using nominal oil prices instead of real oil prices). The results are:


Well, there's also asymmetry on the effects of oil price shocks on consumer price index. Increases in oil prices are significant, but decreases in oil prices are not. I also did not show it here, but the coefficients of the lagged variables of positive oil price shocks are positive (except for the fourth quarter). This indicates that increases in oil prices are associated with increases in overall consumer prices.

March 28, 2011

Simulating an ARMA Process using STATA

If you need to simulate an ARMA process (provided you already know the coefficients of both the AR component and the MA component), you can use STATA to do so. What you need is the SIM_ARMA program created by Jeff Pitblado. You can download this program from the Boston College STATA program repository.

For example, suppose you want to simulate an ARMA(2,2) process with:

α (z) = 1 - 2/3 z + 1/9 z2
β (z) = 1 + ¼ z + ¼ z2
εt ∼ WN(o,1)
n = 600

You use the following stata command:

sim_arma y, arcoef(.66666667 -.11111111) macoef(.25 .25) et(e) nobs(600) sigma(1) time(time)

y is the resulting simulated data. The numbers inside the parenthesis of arcoef and macoef assigns the coefficients of the AR component and the MA component, respectively. Here, I use decimal numbers since it seems it does not read fractions (STATA seems to read and numbers with the "/" sign as an interval). et assigns the name of the error term while time assigns the name of the time variable. Finally, sigma assigns the standard deviation of the error term.