help for reghdfe.ado
Title
reghdfe -- Linear and instrumental-variable/GMM regression absorbing
multiple levels of fixed effects
Syntax
reghdfe depvar [indepvars] [(endogvars = iv_vars)] [if] [in] [weight]
, absorb(absvars) [options]
options Description
--------------------------------------------------------------------------
Model [+]
* absorb(absvars) identifiers of the absorbed fixed effects; each
absvar represents one set of fixed effects
absorb(..., savefe) save all fixed effect estimates (__hdfe* prefix);
useful for a subsequent predict. However, see
also the resid option.
residuals(newvar) save residuals; more direct and much faster than
saving the fixed effects and then running
predict
summarize(stats) equivalent to estat summarize after the
regression, but more flexible, compatible with
the fast option, and saves results on
e(summarize)
suboptions(...) additional options that will be passed to the
regression command (either regress, ivreg2, or
ivregress)
SE/Robust [+]
+ vce(vcetype [,opt]) vcetype may be unadjusted (default), robust or
cluster fvvarlist (allowing two- and multi-way
clustering)
suboptions bw(#), kernel(str), dkraay(#) and
kiefer allow for AC/HAC estimates; see the avar
package
Instrumental-Variable/2SLS/GMM [+]
estimator(str) either 2sls (default), gmm2s (two-stage GMM), liml
(limited-information maximum likelihood) or cue
(which gives approximate results, see discussion
below)
stages(list) estimate additional regressions; choose any of
first ols reduced acid (or all)
ffirst compute first-stage diagnostic and identification
statistics
ivsuite(subcmd) package used in the IV/GMM regressions; options
are ivreg2 (default; needs installing) and
ivregress
Diagnostic [+]
verbose(#) amount of debugging information to show (0=None,
1=Some, 2=More, 3=Parsing/convergence details,
4=Every iteration)
timeit show elapsed times by stage of computation
Optimization [+]
+ tolerance(#) criterion for convergence (default=1e-8)
maxiterations(#) maximum number of iterations (default=10,000); if
set to missing (.) it will run for as long as it
takes.
poolsize(#) apply the within algorithm in groups of #
variables (default 10). a large poolsize is
usually faster but uses more memory
acceleration(str) acceleration method; options are
conjugate_gradient (cg), steep_descent (sd),
aitken (a), and none (no)
transform(str) transform operation that defines the type of
alternating projection; options are Kaczmarz
(kac), Cimmino (cim), Symmetric Kaczmarz (sym)
Speedup Tricks [+]
cache(save [,opt]) absorb all variables without regressing
(destructive; combine it with preserve/restore)
suboption keep(varlist) adds additional
untransformed variables to the resulting dataset
cache(use) run regressions on cached data; vce() must be the
same as with cache(save).
cache(clear) delete Mata objects to clear up memory; no more
regressions can be run after this
fast will not create e(sample); disabled when saving
fixed effects or mobility groups
Degrees-of-Freedom Adjustments [+]
dofadjustments(list) allows selecting the desired adjustments for
degrees of freedom; rarely used
groupvar(newvar) unique identifier for the first mobility group
Reporting [+]
version reports the version number and date of reghdfe,
and saves it in e(version). standalone option
level(#) set confidence level; default is level(95)
display_options control column formats, row spacing, line width,
display of omitted variables and base and empty
cells, and factor-variable labeling
Undocumented
keepsingletons do not drop singleton groups
old will call the latest 2.x version of reghdfe
instead (see the old help file)
--------------------------------------------------------------------------
* absorb(absvars) is required.
+ indicates a recommended or important option.
indepvars, endogvars and iv_vars may contain factor variables; see
fvvarlist.
all the regression variables may contain time-series operators; see
tsvarlist.
fweights, aweights and pweights are allowed; see weight.
Absvar Syntax
absvar Description
--------------------------------------------------------------------------
i.varname categorical variable to be absorbed (the i. prefix
is tacit)
i.var1#i.var2 absorb the interactions of multiple categorical
variables
i.var1#c.var2 absorb heterogeneous slopes, where var2 has a
different slope coef. depending on the category
of var1
var1##c.var2 equivalent to "i.var1 i.var1#c.var2", but much
faster
var1##c.(var2 var3) multiple heterogeneous slopes are allowed
together. Alternative syntax: var1##(c.var2
c.var3)
v1#v2#v3##c.(v4 v5) factor operators can be combined
--------------------------------------------------------------------------
To save the estimates specific absvars, write newvar=absvar.
Please be aware that in most cases these estimates are neither consistent
nor econometrically identified.
Using categorical interactions (e.g. x#z) is faster than running egen
group(...) beforehand.
Singleton obs. are dropped iteratively until no more singletons are found
(see ancilliary article for details).
Slope-only absvars ("state#c.time") have poor numerical stability and slow
convergence. If you need those, either i) increase tolerance or ii) use
slope-and-intercept absvars ("state##c.time"), even if the intercept is
redundant. For instance if absvar is "i.zipcode i.state##c.time" then
i.state is redundant given i.zipcode, but convergence will still be much
faster.
Description
reghdfe is a generalization of areg (and xtreg,fe, xtivreg,fe) for
multiple levels of fixed effects (including heterogeneous slopes),
alternative estimators (2sls, gmm2s, liml), and additional robust standard
errors (multi-way clustering, HAC standard errors, etc).
Additional features include:
a) A novel and robust algorithm to efficiently absorb the fixed effects
(extending the work of Guimaraes and Portugal, 2010).
b) Coded in Mata, which in most scenarios makes it even faster than areg
and xtreg for a single fixed effect (see benchmarks on the Github
page).
c) Can save the point estimates of the fixed effects (caveat emptor: the
fixed effects may not be identified, see the references).
d) Calculates the degrees-of-freedom lost due to the fixed effects (note:
beyond two levels of fixed effects, this is still an open problem, but
we provide a conservative approximation).
e) Iteratively removes singleton groups by default, to avoid biasing the
standard errors (see ancillary document).
Options
+-----------------------+
----+ Model and Miscellanea +---------------------------------------------
absorb(absvars) list of categorical variables (or interactions)
representing the fixed effects to be absorbed. this is equivalent to
including an indicator/dummy variable for each category of each
absvar. absorb() is required.
To save a fixed effect, prefix the absvar with "newvar=". For
instance, the option absorb(firm_id worker_id year_coefs=year_id) will
include firm, worker and year fixed effects, but will only save the
estimates for the year fixed effects (in the new variable year_coefs).
If you want to predict afterwards but don't care about setting the
names of each fixed effect, use the savefe suboption. This will
delete all variables named __hdfe*__ and create new ones as required.
Example: reghdfe price weight, absorb(turn trunk, savefe)
residuals(newvar) will save the regression residuals in a new variable.
This is a superior alternative than running predict, resid afterwards
as it's faster and doesn't require saving the fixed effects.
summarize(stats) will report and save a table of summary of statistics of
the regression variables (including the instruments, if applicable),
using the same sample as the regression.
summarize (without parenthesis) saves the default set of statistics:
mean min max.
The complete list of accepted statistics is available in the tabstat
help. The most useful are count range sd median p##.
The summary table is saved in e(summarize)
To save the summary table silently (without showing it after the
regression table), use the quietly suboption. You can use it by itself
(summarize(,quietly)) or with custom statistics (summarize(mean,
quietly)).
suboptions(...) options that will be passed directly to the regression
command (either regress, ivreg2, or ivregress)
+-----------+
----+ SE/Robust +---------------------------------------------------------
vce(vcetype, subopt) specifies the type of standard error reported. Note
that all the advanced estimators rely on asymptotic theory, and will
likely have poor performance with small samples (but again if you are
using reghdfe, that is probably not your case)
unadjusted/ols estimates conventional standard errors, valid even in
small samples under the assumptions of homoscedasticity and no
correlation between observations
robust estimates heteroscedasticity-consistent standard errors
(Huber/White/sandwich estimators), but still assuming independence
between observations
Warning: in a FE panel regression, using robust will lead to
inconsistent standard errors if for every fixed effect, the other
dimension is fixed. For instance, in an standard panel with
individual and time fixed effects, we require both the number of
individuals and time periods to grow asymptotically. If that is not
the case, an alternative may be to use clustered errors, which as
discussed below will still have their own asymptotic requirements.
For a discussion, see Stock and Watson, "Heteroskedasticity-robust
standard errors for fixed-effects panel-data regression," Econometrica
76 (2008): 155-174
cluster clustervars estimates consistent standard errors even when the
observations are correlated within groups.
Multi-way-clustering is allowed. Thus, you can indicate as many
clustervars as desired (e.g. allowing for intragroup correlation
across individuals, time, country, etc).
Each clustervar permits interactions of the type var1#var2 (this is
faster than using egen group() for a one-off regression).
Warning: The number of clusters, for all of the cluster variables,
must go off to infinity. A frequent rule of thumb is that each
cluster variable must have at least 50 different categories (the
number of categories for each clustervar appears on the header of the
regression table).
The following suboptions require either the ivreg2 or the avar package
from SSC. For a careful explanation, see the ivreg2 help file, from which
the comments below borrow.
unadjusted, bw(#) (or just , bw(#)) estimates
autocorrelation-consistent standard errors (Newey-West).
robust, bw(#) estimates autocorrelation-and-heteroscedasticity
consistent standard errors (HAC).
cluster clustervars, bw(#) estimates standard errors consistent to
common autocorrelated disturbances (Driscoll-Kraay). At most two
cluster variables can be used in this case.
, kiefer estimates standard errors consistent under arbitrary
intra-group autocorrelation (but not heteroskedasticity) (Kiefer).
kernel(str) is allowed in all the cases that allow bw(#) The default
kernel is bar (Bartlett). Valid kernels are Bartlett (bar); Truncated
(tru); Parzen (par); Tukey-Hanning (thann); Tukey-Hamming (thamm);
Daniell (dan); Tent (ten); and Quadratic-Spectral (qua or qs).
Advanced suboptions:
, suite(default|mwc|avar) overrides the package chosen by reghdfe to
estimate the VCE. default uses the default Stata computation (allows
unadjusted, robust, and at most one cluster variable). mwc allows
multi-way-clustering (any number of cluster variables), but without
the bw and kernel suboptions. avar uses the avar package from SSC. Is
the same package used by ivreg2, and allows the bw, kernel, dkraay and
kiefer suboptions. This is useful almost exclusively for debugging.
, twicerobust will compute robust standard errors not only on the
first but on the second step of the gmm2s estimation. Requires
ivsuite(ivregress), but will not give the exact same results as
ivregress.
Explanation: When running instrumental-variable regressions with the
ivregress package, robust standard errors, and a gmm2s estimator,
reghdfe will translate vce(robust) into wmatrix(robust)
vce(unadjusted). This maintains compatibility with ivreg2 and other
packages, but may unadvisable as described in ivregress (technical
note). Specifying this option will instead use wmatrix(robust)
vce(robust).
However, computing the second-step vce matrix requires computing
updated estimates (including updated fixed effects). Since reghdfe
currently does not allow this, the resulting standard errors will not
be exactly the same as with ivregress. This issue is similar to
applying the CUE estimator, described further below.
Note: The above comments are also appliable to clustered standard
error.
+-------------+
----+ IV/2SLS/GMM +-------------------------------------------------------
estimator(2sls|gmm2s|liml|cue) estimator used in the instrumental-variable
estimation
2sls (two-stage least squares, default), gmm2s (two-stage efficient
GMM), liml (limited-information maximum likelihood), and cue
("continuously-updated" GMM) are allowed.
Warning: cue will not give the same results as ivreg2. See the
discussion in Baum, Christopher F., Mark E. Schaffer, and Steven
Stillman. "Enhanced routines for instrumental variables/GMM estimation
and testing." Stata Journal 7.4 (2007): 465-506 (page 484). Note that
even if this is not exactly cue, it may still be a desirable/useful
alternative to standard cue, as explained in the article.
stages(list) adds and saves up to four auxiliary regressions useful when
running instrumental-variable regressions:
first all first-stage regressions
ols ols regression (between dependent variable and endogenous
variables; useful as a benchmark)
reduced reduced-form regression (ols regression with included and
excluded instruments as regressors)
acid an "acid" regression that includes both instruments and
endogenous variables as regressors; in this setup, excluded
instruments should not be significant.
You can pass suboptions not just to the iv command but to all stage
regressions with a comma after the list of stages. Example:
reghdfe price (weight=length), absorb(turn) subopt(nocollin)
stages(first, eform(exp(beta)) )
By default all stages are saved (see estimates dir). The suboption
,nosave will prevent that. However, future replays will only replay
the iv regression.
ffirst compute and report first stage statistics (details); requires the
ivreg2 package.
These statistics will be saved on the e(first) matrix. If the
first-stage estimates are also saved (with the stages() option), the
respective statistics will be copied to e(first_*) locals.
ivsuite(subcmd) allows the IV/2SLS regression to be run either using
ivregress or ivreg2.
ivreg2 is the default, but needs to be installed for that option to
work.
+------------+
----+ Diagnostic +--------------------------------------------------------
verbose(#) orders the command to print debugging information.
Possible values are 0 (none), 1 (some information), 2 (even more), 3
(adds dots for each iteration, and reportes parsing details), 4 (adds
details for every iteration step)
For debugging, the most useful value is 3. For simple status reports,
set verbose to 1.
timeit shows the elapsed time at different steps of the estimation. Most
time is usually spent on three steps: map_precompute(), map_solve()
and the regression step.
+--------------------------------+
----+ Degrees-of-Freedom Adjustments +------------------------------------
dofadjustments(doflist) selects how the degrees-of-freedom, as well as
e(df_a), are adjusted due to the absorbed fixed effects.
Without any adjustment, we would assume that the degrees-of-freedom
used by the fixed effects is equal to the count of all the fixed
effects (e.g. number of individuals + number of years in a typical
panel). However, in complex setups (e.g. fixed effects by individual,
firm, job position, and year), there may be a huge number of fixed
effects collinear with each other, so we want to adjust for that.
Note: changing the default option is rarely needed, except in
benchmarks, and to obtain a marginal speed-up by excluding the
pairwise option.
all is the default and almost always the best alternative. It is
equivalent to dof(pairwise clusters continuous)
none assumes no collinearity across the fixed effects (i.e. no
redundant fixed effects). This is overtly conservative, although it is
the faster method by virtue of not doing anything.
firstpair will exactly identify the number of collinear fixed effects
across the first two sets of fixed effects (i.e. the first absvar and
the second absvar). The algorithm used for this is described in Abowd
et al (1999), and relies on results from graph theory (finding the
number of connected sub-graphs in a bipartite graph). It will not do
anything for the third and subsequent sets of fixed effects.
For more than two sets of fixed effects, there are no known results
that provide exact degrees-of-freedom as in the case above. One
solution is to ignore subsequent fixed effects (and thus oversestimate
e(df_a) and understimate the degrees-of-freedom). Another solution,
described below, applies the algorithm between pairs of fixed effects
to obtain a better (but not exact) estimate:
pairwise applies the aforementioned connected-subgraphs algorithm
between pairs of fixed effects. For instance, if there are four sets
of FEs, the first dimension will usually have no redundant
coefficients (i.e. e(M1)==1), since we are running the model without a
constant. For the second FE, the number of connected subgraphs with
respect to the first FE will provide an exact estimate of the
degrees-of-freedom lost, e(M2).
For the third FE, we do not know exactly. However, we can compute the
number of connected subgraphs between the first and third G(1,3), and
second and third G(2,3) fixed effects, and choose the higher of those
as the closest estimate for e(M3). For the fourth FE, we compute
G(1,4), G(2,4) and G(3,4) and again choose the highest for e(M4).
Finally, we compute e(df_a) = e(K1) - e(M1) + e(K2) - e(M2) + e(K3) -
e(M3) + e(K4) - e(M4); where e(K#) is the number of levels or
dimensions for the #-th fixed effect (e.g. number of individuals or
years). Note that e(M3) and e(M4) are only conservative estimates and
thus we will usually be overestimating the standard errors. However,
given the sizes of the datasets typically used with reghdfe, the
difference should be small.
Since the gain from pairwise is usually minuscule for large datasets,
and the computation is expensive, it may be a good practice to exclude
this option for speedups.
clusters will check if a fixed effect is nested within a clustervar.
In that case, it will set e(K#)==e(M#) and no degrees-of-freedom will
be lost due to this fixed effect. The rationale is that we are
already assuming that the number of effective observations is the
number of cluster levels. This is the same adjustment that xtreg, fe
does, but areg does not use it.
continuous Fixed effects with continuous interactions (i.e. individual
slopes, instead of individual intercepts) are dealt with differently.
In an i.categorical#c.continuous interaction, we will do one check: we
count the number of categories where c.continuous is always zero. In
an i.categorical##c.continuous interaction, we do the above check but
replace zero for any particular constant. In the case where
continuous is constant for a level of categorical, we know it is
collinear with the intercept, so we adjust for it.
Additional methods, such as bootstrap are also possible but not yet
implemented. Some preliminary simulations done by the author showed a
very poor convergence of this method.
groupvar(newvar) name of the new variable that will contain the first
mobility group. Requires pairwise, firstpair, or the default all.
+------------------------+
----+ Speeding Up Estimation +--------------------------------------------
reghdfe varlist [if] [in], absorb(absvars) save(cache) [options]
This will transform varlist, absorbing the fixed effects indicated by
absvars. It is useful when running a series of alternative
specifications with common variables, as the variables will only be
transformed once instead of every time a regression is run.
It replaces the current dataset, so it is a good idea to precede it
with a preserve command
To keep additional (untransformed) variables in the new dataset, use
the keep(varlist) suboption.
cache(use) is used when running reghdfe after a save(cache) operation.
Both the absorb() and vce() options must be the same as when the cache
was created (the latter because the degrees of freedom were computed
at that point).
cache(clear) will delete the Mata objects created by reghdfe and kept in
memory after the save(cache) operation. These objects may consume a
lot of memory, so it is a good idea to clean up the cache.
Additionally, if you previously specified preserve, it may be a good
time to restore.
Example:
. sysuse auto
. preserve
.
. * Save the cache
. reghdfe price weight length, a(turn rep) vce(turn) cache(save,
keep(foreign))
.
. * Run regressions
. reghdfe price weight, a(turn rep) cache(use)
. reghdfe price length, a(turn rep) cache(use)
.
. * Clean up
. reghdfe, cache(clear)
. restore
fast avoids saving e(sample) into the regression. Since saving the
variable only involves copying a Mata vector, the speedup is currently
quite small. Future versions of reghdfe may change this as features
are added.
Note that fast will be disabled when adding variables to the dataset
(i.e. when saving residuals, fixed effects, or mobility groups), and
is incompatible with most postestimation commands.
If you wish to use fast while reporting estat summarize, see the
summarize option.
+--------------+
----+ Optimization +------------------------------------------------------
tolerance(#) specifies the tolerance criterion for convergence; default is
tolerance(1e-8)
Note that for tolerances beyond 1e-14, the limits of the double
precision are reached and the results will most likely not converge.
At the other end, is not tight enough, the regression may not identify
perfectly collinear regressors. However, those cases can be easily
spotted due to their extremely high standard errors.
Warning: when absorbing heterogeneous slopes without the accompanying
heterogeneous intercepts, convergence is quite poor and a tight
tolerance is strongly suggested (i.e. higher than the default). In
other words, an absvar of var1##c.var2 converges easily, but an absvar
of var1#c.var2 will converge slowly and may require a tighter
tolerance.
maxiterations(#) specifies the maximum number of iterations; the default
is maxiterations(10000); set it to missing (.) to run forever until
convergence.
poolsize(#) Number of variables that are pooled together into a matrix
that will then be transformed. The default is to pool variables in
groups of 5. Larger groups are faster with more than one processor,
but may cause out-of-memory errors. In that case, set poolsize to 1.
Advanced options:
acceleration(str) allows for different acceleration techniques, from the
simplest case of no acceleration (none), to steep descent
(steep_descent or sd), Aitken (aitken), and finally Conjugate Gradient
(conjugate_gradient or cg).
Note: Each acceleration is just a plug-in Mata function, so a larger
number of acceleration techniques are available, albeit undocumented
(and slower).
transform(str) allows for different "alternating projection" transforms.
The classical transform is Kaczmarz (kaczmarz), and more stable
alternatives are Cimmino (cimmino) and Symmetric Kaczmarz
(symmetric_kaczmarz)
Note: Each transform is just a plug-in Mata function, so a larger
number of acceleration techniques are available, albeit undocumented
(and slower).
Note: The default acceleration is Conjugate Gradient and the default
transform is Symmetric Kaczmarz. Be wary that different accelerations
often work better with certain transforms. For instance, do not use
conjugate gradient with plain Kaczmarz, as it will not converge.
precondition (currently disabled)
+-----------+
----+ Reporting +---------------------------------------------------------
level(#) sets confidence level; default is level(95)
display_options: noomitted, vsquish, noemptycells, baselevels,
allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(%fmt),
pformat(%fmt), sformat(%fmt), and nolstretch; see [R] estimation
options.
Postestimation Syntax
Only estat summarize, predict and test are currently supported and tested.
estat summarize
Summarizes depvar and the variables described in _b (i.e.
> not the excluded instruments)
predict newvar [if] [in] [, statistic]
Requires all set of fixed effects to be previously saved b
> y reghdfe (except for option xb)
Equation: y = xb + d_absorbvars + e
statistic Description
--------------------------------------------------------------------------
Main
xb xb fitted values; the default
xbd xb + d_absorbvars
d d_absorbvars
residuals residual
score score; equivalent to residuals
--------------------------------------------------------------------------
test
Performs significance test on the parameters, see the stat
> a help
suest
If you want to perform tests that are usually run with suest, such as
non-nested models, tests using alternative specifications of the
variables, or tests on different groups, you can replicate it manually, as
described here.
Note: do not use suest. It will run, but the results will be incorrect.
Possible Pitfalls and Common Mistakes
1. (note: as of version 2.1, the constant is no longer reported) Ignore
the constant; it doesn't tell you much. If you want to use descriptive
stats, that's what the summarize() and estat summ commands are for.
Even better, use noconstant to drop it (although it's not really
dropped as it never existed on the first place!)
2. Think twice before saving the fixed effects. They are probably
inconsistent / not identified and you will likely be using them wrong.
3. (note: as of version 3.0 singletons are dropped by default) It's good
practice to drop singletons. dropsingleton is your friend.
4. If you use vce(robust), be sure that your other dimension is not
"fixed" but grows with N, or your SEs will be wrong.
5. If you use vce(cluster ...), check that your number of clusters is
high enough (50+ is a rule of thumb). If not, you are making the SEs
even worse!
6. The panel variables (absvars) should probably be nested within the
clusters (clustervars) due to the within-panel correlation induced by
the FEs. (this is not the case for *all* the absvars, only those that
are treated as growing as N grows)
7. If you run analytic or probability weights, you are responsible for
ensuring that the weights stay constant within each unit of a fixed
effect (e.g. individual), or that it is correct to allow
varying-weights for that case.
8. Be aware that adding several HDFEs is not a panacea. The first
limitation is that it only uses within variation (more than acceptable
if you have a large enough dataset). The second and subtler
limitation occurs if the fixed effects are themselves outcomes of the
variable of interest (as crazy as it sounds). For instance, imagine a
regression where we study the effect of past corporate fraud on future
firm performance. We add firm, CEO and time fixed-effects (standard
practice). This introduces a serious flaw: whenever a fraud event is
discovered, i) future firm performance will suffer, and ii) a CEO
turnover will likely occur. Moreover, after fraud events, the new
CEOs are usually specialized in dealing with the aftershocks of such
events (and are usually accountants or lawyers). The fixed effects of
these CEOs will also tend to be quite low, as they tend to manage
firms with very risky outcomes. Therefore, the regressor (fraud)
affects the fixed effect (identity of the incoming CEO). Adding
particularly low CEO fixed effects will then overstate the performance
of the firm, and thus understate the negative effects of fraud on
future firm performance.
Missing Features
(If you are interested in discussing these or others, feel free to contact
me)
Code, medium term:
- Complete GT preconditioning (v4)
- Improve algorithm that recovers the fixed effects (v5)
- Improve statistics and tests related to the fixed effects (v5)
- Implement a -bootstrap- option in DoF estimation (v5)
Code, long term:
- The interaction with cont vars (i.a#c.b) may suffer from numerical
accuracy issues, as we are dividing by a sum of squares
- Calculate exact DoF adjustment for 3+ HDFEs (note: not a problem with
cluster VCE when one FE is nested within the cluster)
- More postestimation commands (lincom? margins?)
Theory:
- Add a more thorough discussion on the possible identification issues
- Find out a way to use reghdfe iteratively with CUE (right now only
OLS/2SLS/GMM2S/LIML give the exact same results)
- Not sure if I should add an F-test for the absvars in the vce(robust)
and vce(cluster) cases. Discussion on e.g. -areg- (methods and
formulas) and textbooks suggests not; on the other hand, there may be
alternatives: A Heteroskedasticity-Robust F-Test Statistic for
Individual Effects
Examples
--------------------------------------------------------------------------------
Setup
. sysuse auto
Simple case - one fixed effect
. reghdfe price weight length, absorb(rep78)
--------------------------------------------------------------------------------
As above, but also compute clustered standard errors
. reghdfe price weight length, absorb(rep78) vce(cluster rep78)
--------------------------------------------------------------------------------
Two and three sets of fixed effects
. webuse nlswork
. reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode
year)
. reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode
year occ)
--------------------------------------------------------------------------------
Advanced examples
Save the FEs as variables
. reghdfe ln_w grade age ttl_exp tenure not_smsa south ,
absorb(FE1=idcode FE2=year)
Report nested F-tests
. reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode
year) nested
Do AvgE instead of absorb() for one FE
. reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode
year) avge(occ)
. reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode
year) avge(AvgByOCC=occ)
Check that FE coefs are close to 1.0
. reghdfe ln_w grade age ttl_exp tenure not_smsa , absorb(idcode year)
check
Save first mobility group
. reghdfe ln_w grade age ttl_exp tenure not_smsa , absorb(idcode occ)
group(mobility_occ)
Factor interactions in the independent variables
. reghdfe ln_w i.grade#i.age ttl_exp tenure not_smsa , absorb(idcode
occ)
Interactions in the absorbed variables (notice that only the # symbol is
allowed)
. reghdfe ln_w grade age ttl_exp tenure not_smsa , absorb(idcode#occ)
Interactions in both the absorbed and AvgE variables (again, only the #
symbol is allowed)
. reghdfe ln_w grade age ttl_exp not_smsa , absorb(idcode#occ)
avge(tenure#occ)
IV regression
. sysuse auto
. reghdfe price weight (length=head), absorb(rep78)
. reghdfe price weight (length=head), absorb(rep78) first
. reghdfe price weight (length=head), absorb(rep78) ivsuite(ivregress)
Factorial interactions
. reghdfe price weight (length=head), absorb(rep78)
. reghdfe price weight length, absorb(rep78 turn##c.price)
Stored results
reghdfe stores the following in e():
Note: it also keeps most e() results placed by the regression subcommands
(ivreg2, ivregress)
Scalars
e(N) number of observations
e(N_hdfe) number of absorbed fixed-effects
e(tss) total sum of squares
e(rss) residual sum of squares
e(r2) R-squared
e(r2_a) adjusted R-squared
e(r2_within) Within R-squared
e(r2_a_within) Adjusted Within R-squared
e(df_a) degrees of freedom lost due to the fixed effects
e(rmse) root mean squared error
e(ll) log-likelihood
e(ll_0) log-likelihood of fixed-effect-only regression
e(F) F statistic
e(F_absorb) F statistic for absorbed effect note: currently
disabled
e(rank) rank of e(V)
e(N_clustervars) number of cluster variables
e(clust#) number of clusters for the #th cluster variable
e(N_clust) number of clusters; minimum of e(clust#)
e(K#) Number of categories of the #th absorbed FE
e(M#) Number of redundant categories of the #th
absorbed FE
e(mobility) Sum of all e(M#)
e(df_m) model degrees of freedom
e(df_r) residual degrees of freedom
Macros
e(cmd) reghdfe
e(subcmd) either regress, ivreg2 or ivregress
e(model) ols, iv, gmm2s, liml or cue
e(cmdline) command as typed
e(dofmethod) dofmethod employed in the regression
e(depvar) name of dependent variable
e(indepvars) names of independent variables
e(endogvars) names of endogenous right-hand-side variables
e(instruments) names of excluded instruments
e(absvars) name of the absorbed variables or interactions
e(title) title in estimation output
e(clustvar) name of cluster variable
e(clustvar#) name of the #th cluster variable
e(vce) vcetype specified in vce()
e(vcetype) title used to label Std. Err.
e(stage) stage within an IV-regression; only if stages()
was used
e(properties) b V
Matrices
e(b) coefficient vector
e(V) variance-covariance matrix of the estimators
Functions
e(sample) marks estimation sample
Author
Sergio Correia
Fuqua School of Business, Duke University
Email: sergio.correia@duke.edu
User Guide
A copy of this help file, as well as a more in-depth user guide is in
development and will be available at http://scorreia.com/reghdfe.
Latest Updates
reghdfe is updated frequently, and upgrades or minor bug fixes may not be
immediately available in SSC. To check or contribute to the latest
version of reghdfe, explore the Github repository. Bugs or missing
features can be discussed through email or at the Github issue tracker.
To see your current version and installed dependencies, type reghdfe,
version
Acknowledgements
This package wouldn't have existed without the invaluable feedback and
contributions of Paulo Guimaraes, Amine Ouazad, Mark Schaffer and Kit
Baum. Also invaluable are the great bug-spotting abilities of many users.
In addition, reghdfe is build upon important contributions from the Stata
community:
reg2hdfe, from Paulo Guimaraes, and a2reg from Amine Ouazad, were the
inspiration and building blocks on which reghdfe was built.
ivreg2, by Christopher F Baum, Mark E Schaffer and Steven Stillman, is the
package used by default for instrumental-variable regression.
avar by Christopher F Baum and Mark E Schaffer, is the package used for
estimating the HAC-robust standard errors of ols regressions.
tuples by Joseph Lunchman and Nicholas Cox, is used when computing
standard errors with multi-way clustering (two or more clustering
variables).
References
The algorithm underlying reghdfe is a generalization of the works by:
Paulo Guimaraes and Pedro Portugal. "A Simple Feasible Alternative
Procedure to Estimate Models with High-Dimensional Fixed Effects".
Stata Journal, 10(4), 628-649, 2010. [link]
Simen Gaure. "OLS with Multiple High Dimensional Category Dummies".
Memorandum 14/2010, Oslo University, Department of Economics, 2010.
[link]
It addresses many of the limitation of previous works, such as possible lack
of convergence, arbitrary slow convergence times, and being limited to only
two or three sets of fixed effects (for the first paper). The paper
explaining the specifics of the algorithm is a work-in-progress and available
upon request.
If you use this program in your research, please cite either the REPEC entry or
the aforementioned papers.
Additional References
For details on the Aitken acceleration technique employed, please see "method 3"
as described by:
Macleod, Allan J. "Acceleration of vector sequences by multi-dimensional
Delta-2 methods." Communications in Applied Numerical Methods 2.4
(1986): 385-392.
For the rationale behind interacting fixed effects with continuous variables,
see:
Duflo, Esther. "The medium run effects of educational expansion: Evidence
from a large school construction program in Indonesia." Journal of
Development Economics 74.1 (2004): 163-197. [link]
Also see:
Abowd, J. M., R. H. Creecy, and F. Kramarz 2002. Computing person and
firm effects using linked longitudinal employer-employee data. Census
Bureau Technical Paper TP-2002-06.
Cameron, A. Colin & Gelbach, Jonah B. & Miller, Douglas L., 2011. "Robust
Inference With Multiway Clustering," Journal of Business & Economic
Statistics, American Statistical Association, vol. 29(2), pages
238-249.
Gormley, T. & Matsa, D. 2014. "Common errors: How to (and not to) control
for unobserved heterogeneity." The Review of Financial Studies, vol.
27(2), pages 617-661.
Mittag, N. 2012. "New methods to estimate models with large sets of fixed
effects with an application to matched employer-employee data from
Germany." FDZ-Methodenreport 02/2012.