Department of Applied Economics
University of Minnesota
So far, we have learned…
data.table
package.ggplot2
package.With the tools we learned so far, you can do a lot of tasks for descriptive data analysis!
Once you have a good understanding of the data, you can move on to the next step: econometric analysis!
Today’s goal is to:
lm
function to estimate a regression model and report the results with publication-ready summary tables.modelsummary
package [See here for the package documentation, important!]Today’s lecture is an introduction to basic regression analysis with R.
The more advanced R functions such as feols()
function from fixest
package for fixed effects models and glm()
function for generalized linear models will be covered in the Econometric class (APEC 8211-8214).
We will use the CPS1988
dataset from the AER
package. It’s a cross-section dataset originating from the March 1988 Current Population Survey by the US Census Bureau. For further information, see ?CPS1988
after loading the package.
Run the following code:
The most basic function to estimate a linear regression model in R is the lm
function from stats
package, which is a built-in R package.
Suppose we have a regression model of the form: \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + e\]
With the lm
function, we can estimate the above model as follows:
lm
function, you specify the formula of the regression model.~
splits the left side and right side in a formula.Let’s estimate the following model with the CPS1988
data:
\[wage = \beta_0 + \beta_1 education + \beta_2 experience + e\]
To see the summary of the regression results, use the summary
function.
The results from lm()
and summary()
contain a lot of information (In your Rstudio, you can check them on the Environment pane).
You can access any information stored in object via the $
operator.
lm()
vs summary(lm())
objects.Category | In lm() object (reg ) |
In summary(lm()) object (reg_summary ) |
---|---|---|
Coefficients |
coefficients : estimated β-hats |
coefficients : table of β-hat, Std. Error, t, p |
Residuals |
residuals : raw residuals |
residuals : residuals (trimmed for summary) |
Fitted Values |
fitted.values : predicted ŷ |
– |
Model Info |
call , terms , model , assign , xlevels
|
call , terms , plus degrees of freedom info |
Diagnostics | – |
r.squared , adj.r.squared , sigma , fstatistic
|
Variance–Covariance |
qr , effects , rank , df.residual
|
cov.unscaled , aliased
|
Degrees of Freedom | df.residual |
df : regression, residual, total |
lm()
function:
*
= main effects + interactions.:
= interaction only.lm()
function, use the I()
function.
I()
= arithmetic (square, product, etc).log()
function in the formula.
Example:
To estimate: \[log(wage) = \beta_0 + \beta_1 education + \beta_2 experience + \beta3 experience ^2 + e\]
What if we want to include a categorical variable (e.g., region
, parttime
, ethnicity
) in the regression model?
lm()
function is smart enough to convert the categorical variable into dummy variables without any additional coding.
lm()
function automatically coerced it into a factor variable.What if we want to include a dummy variable that takes 1 if parttime
is yes, otherwise 0?
The model is as follows: \[ log(wage) = \beta_0 + \beta_1 education + \beta_2 experience + \beta_3 experience^2 + \beta_4 d_{parttime, yes} + e \]
What if we want to include dummy variables for each region
?
By default, R picks the first group in the alphabetical order for the base group.
You can use relevel()
function (a built-in R function) to set the base group of the categorical variable.
Syntax:
Example:
Let’s compare the two regression results:
parttime==no
as the base groupparttime==yes
as the base groupTo do prediction with the estimated model on a new dataset, you can use the predict
function (built-in R function).
Syntax
Example
You should at least know these key points:
lm()
and summary()
function.lm()
and summary()
functions.lm()
function.lm()
function, and how to set the base group.That’s it!
modelsummary
packagemodelsummary
package lets you create a nice summary table to report the descriptive statistics of the data and the regression results.
We focus on two functions in the modelsummary
package:
datasummary()
: to create a summary table for the descriptive statistics of the data.modelsummary()
: to create a summary table for the regression results.Check the documentation for more details.
Mean | SD | Min | Max | |
---|---|---|---|---|
Wage | 603.73 | 453.55 | 50.05 | 18777.20 |
Education | 13.07 | 2.90 | 0.00 | 18.00 |
Experience | 18.20 | 13.08 | -4.00 | 63.00 |
OLS 1 | OLS 2 | OLS 3 | |
---|---|---|---|
Education | 0.076*** | 0.087*** | 0.086*** |
(0.001) | (0.001) | (0.001) | |
Experience | 0.078*** | 0.077*** | |
(0.001) | (0.001) | ||
Experience squared | -0.001*** | -0.001*** | |
(0.000) | (0.000) | ||
White | -0.243*** | ||
(0.013) | |||
Num.Obs. | 28155 | 28155 | 28155 |
R2 | 0.095 | 0.326 | 0.335 |
R2 Adj. | 0.095 | 0.326 | 0.335 |
* p < 0.05, ** p < 0.01, *** p < 0.001 | |||
Std. Errors in parentheses |
The very basic argument of the modelsummary()
function is the models
argument, which takes a list of regression models you want to report in the table.
Example
(1) | (2) | |
---|---|---|
(Intercept) | 5.178 | 4.278 |
(0.019) | (0.019) | |
education | 0.076 | 0.087 |
(0.001) | (0.001) | |
experience | 0.078 | |
(0.001) | ||
I(experience^2) | -0.001 | |
(0.000) | ||
Num.Obs. | 28155 | 28155 |
R2 | 0.095 | 0.326 |
R2 Adj. | 0.095 | 0.326 |
AIC | 405753.0 | 397432.7 |
BIC | 405777.7 | 397473.9 |
Log.Lik. | -29139.853 | -24977.715 |
F | 2941.787 | 4545.929 |
RMSE | 0.68 | 0.59 |
The default table is okay. But you can customize the appearance of the table. Here, I listed the bare minimum of options you might want to know (There are lots of other options!).
models
: you can change the name of the modelscoef_map
: to reorder coefficient rows and change their labelsstars
: to change the significance starsvcov
: to replace the standard errors with the robust ones (we will see this later)gof_map
: to define which model statistics to displaygof_omit
: to define which model statistics to omit from the default selection of model statisticsnotes
: to add notes at the bottom of the tablefmt
: change the format of numbersNote
+ See ?modelsummary
for more details or see this. + Also check the vignette of the function from here.
By naming the models when you make a list of regression models, you can change the name of the models in the table.
Example
OLS 1 | OLS 2 | |
---|---|---|
(Intercept) | 5.178 | 4.278 |
(0.019) | (0.019) | |
education | 0.076 | 0.087 |
(0.001) | (0.001) | |
experience | 0.078 | |
(0.001) | ||
I(experience^2) | -0.001 | |
(0.000) | ||
Num.Obs. | 28155 | 28155 |
R2 | 0.095 | 0.326 |
R2 Adj. | 0.095 | 0.326 |
AIC | 405753.0 | 397432.7 |
BIC | 405777.7 | 397473.9 |
Log.Lik. | -29139.853 | -24977.715 |
F | 2941.787 | 4545.929 |
RMSE | 0.68 | 0.59 |
coef_map
argument helps you clean up your regression table.
Example
In this example, I renamed the variables and moved the intercept
row to the bottom row.
OLS 1 | OLS 2 | |
---|---|---|
Education | 0.076 | 0.087 |
(0.001) | (0.001) | |
Experience | 0.078 | |
(0.001) | ||
Experience squared | -0.001 | |
(0.000) | ||
Intercept | 5.178 | 4.278 |
(0.019) | (0.019) | |
Num.Obs. | 28155 | 28155 |
R2 | 0.095 | 0.326 |
R2 Adj. | 0.095 | 0.326 |
AIC | 405753.0 | 397432.7 |
BIC | 405777.7 | 397473.9 |
Log.Lik. | -29139.853 | -24977.715 |
F | 2941.787 | 4545.929 |
RMSE | 0.68 | 0.59 |
stars = TRUE
shows the significance stars in the table (Try it!).
If you don’t like it, you can modify significance levels and markers by specifying a named numeric vector (e.g., stars = c("*" = .05, "**" = .01, "***" = .001)
).
Example
OLS 1 | OLS 2 | |
---|---|---|
Education | 0.076*** | 0.087*** |
(0.001) | (0.001) | |
Experience | 0.078*** | |
(0.001) | ||
Experience squared | -0.001*** | |
(0.000) | ||
Intercept | 5.178*** | 4.278*** |
(0.019) | (0.019) | |
Num.Obs. | 28155 | 28155 |
R2 | 0.095 | 0.326 |
R2 Adj. | 0.095 | 0.326 |
AIC | 405753.0 | 397432.7 |
BIC | 405777.7 | 397473.9 |
Log.Lik. | -29139.853 | -24977.715 |
F | 2941.787 | 4545.929 |
RMSE | 0.68 | 0.59 |
* p < 0.05, ** p < 0.01, *** p < 0.001 |
vcov
argument let you replace the non-robust standard errors (default) with the robust one. Here are some options to use the vcov
argument (see this more options).
Option 1: Supply a list of named variance-covariance matrices:
Note
By default, modelsummary()
calculates the robust variance-covariance matrix using the sandwich
package (sandwich::vcovHC
, sandwich::vcovCL
).
First, let’s get the heteroscedasticity robust variance-covariance matrix for the regression models.
Before
OLS 1 | OLS 2 | |
---|---|---|
(Intercept) | 6.009*** | 6.274*** |
(0.013) | (0.004) | |
parttimeyes | -1.152*** | -1.157*** |
(0.013) | (0.013) | |
ethnicitycauc | 0.287*** | |
(0.014) | ||
Num.Obs. | 28155 | 28155 |
R2 | 0.225 | 0.213 |
R2 Adj. | 0.225 | 0.213 |
AIC | 401377.8 | 401799.2 |
BIC | 401410.8 | 401823.9 |
Log.Lik. | -26951.287 | -27162.944 |
F | 4085.846 | 7629.916 |
RMSE | 0.63 | 0.63 |
* p < 0.05, ** p < 0.01, *** p < 0.001 |
VCOV swapped
OLS 1 | OLS 2 | |
---|---|---|
(Intercept) | 6.009*** | 6.274*** |
(0.013) | (0.004) | |
parttimeyes | -1.152*** | -1.157*** |
(0.015) | (0.015) | |
ethnicitycauc | 0.287*** | |
(0.013) | ||
Num.Obs. | 28155 | 28155 |
R2 | 0.225 | 0.213 |
R2 Adj. | 0.225 | 0.213 |
AIC | 401377.8 | 401799.2 |
BIC | 401410.8 | 401823.9 |
Log.Lik. | -26951.287 | -27162.944 |
RMSE | 0.63 | 0.63 |
Std.Errors | Custom | Custom |
* p < 0.05, ** p < 0.01, *** p < 0.001 |
coef_omit
lets you omit coefficient rows from the default selections. In the argument, you specify a vector of names or row number of variables you want to omit from the table.
coef_omit = c(2,3,5)
omits the second, third, and fifth coefficients.Example
Let’s remove the intercept from the table.
By default, the modelsummary()
function reports lots of model statistics (e.g., \(R^2\), \(AIC\), \(BIC\)). You can select or omit the model statistics by specifying the gof_map
and gof_omit
arguments.
modelsummary()
by running modelsummary::gof_map
Example
For example, let’s select only the number of observations, \(R^2\), and adjusted \(R^2\) using the gof_map
argument.
OLS 1 | OLS 2 | |
---|---|---|
Education | 0.076*** | 0.087*** |
(0.001) | (0.001) | |
Experience | 0.078*** | |
(0.001) | ||
Experience squared | -0.001*** | |
(0.000) | ||
Num.Obs. | 28155 | 28155 |
R2 | 0.095 | 0.326 |
R2 Adj. | 0.095 | 0.326 |
Log.Lik. | -29139.853 | -24977.715 |
* p < 0.05, ** p < 0.01, *** p < 0.001 |
notes
lets you add notes at the bottom of the table.fmt
lets you change the format of numbers in the table.Example
For example, let’s select only the number of observations, \(R^2\), and adjusted \(R^2\) using the gof_map
argument.
modelsummary(
models = list("OLS 1" = reg1, "OLS 2" = reg2),
coef_map = c(
"education" = "Education",
"experience" = "Experience",
"I(experience^2)" = "Experience squared"
),
stars = c("*" = .05, "**" = .01, "***" = .001),
gof_map = c("nobs", "r.squared", "adj.r.squared"),
notes = list("Std. Errors in parentheses"),
fmt = 2 #report the numbers with 2 decimal points
)
OLS 1 | OLS 2 | |
---|---|---|
Education | 0.08*** | 0.09*** |
(0.00) | (0.00) | |
Experience | 0.08*** | |
(0.00) | ||
Experience squared | 0.00*** | |
(0.00) | ||
Num.Obs. | 28155 | 28155 |
R2 | 0.095 | 0.326 |
R2 Adj. | 0.095 | 0.326 |
* p < 0.05, ** p < 0.01, *** p < 0.001 | ||
Std. Errors in parentheses |
modelsummary
package has another function called datasummary()
that can create a summary table for the descriptive statistics of the data.
Example
Metropolitan area |
afam
|
cauc
|
|||
---|---|---|---|---|---|
Mean | SD | Mean | SD | ||
no | wage | 337.42 | 214.57 | 527.31 | 419.65 |
education | 11.47 | 2.78 | 12.71 | 2.69 | |
experience | 19.69 | 13.87 | 19.05 | 13.13 | |
yes | wage | 470.38 | 324.97 | 649.39 | 471.05 |
education | 12.51 | 2.73 | 13.28 | 2.96 | |
experience | 18.54 | 13.43 | 17.83 | 12.99 |
datasummary()
function creates a summary table for the descriptive statistics of the data.
Syntax
Note
lm
, formula
takes a two-side formula devided by ~
.Let’s see how it works with an example.
Mean | SD | |
---|---|---|
wage | 603.73 | 453.55 |
education | 13.07 | 2.90 |
experience | 18.20 | 13.08 |
Note
+
to include more rows and columns.modelsummary
package offers multiple summary functions of its own:
Mean
, SD
, Median
, Min
, Max
, P0
, P25
, P50
, P75
, P100
, Histogram
NA
values are automatically stripped before the computation proceeds. So you don’t need to worry about it.In the formula
argument, you can use All()
function to create a summary table for all the numeric variables in the dataset.
Mean | SD | |
---|---|---|
wage | 603.73 | 453.55 |
education | 13.07 | 2.90 |
experience | 18.20 | 13.08 |
mean | SD | |
---|---|---|
wage | 603.73 | 453.55 |
education | 13.07 | 2.90 |
experience | 18.20 | 13.08 |
Play with the datasummary() function:
Exchange the rows and columns in the formula and see how the table looks.
Add other statistics or variables to the formula and see how the table looks.
datasummary
can nest variables and statistics inside categorical variables using the *
symbol. For example, you can display separate means and SD’s for each value of ethnicity
.
Example 1: Nested rows
ethnicity | mean | SD | |
---|---|---|---|
afam | wage | 446.85 | 312.44 |
education | 12.33 | 2.77 | |
experience | 18.74 | 13.51 | |
cauc | wage | 617.23 | 461.21 |
education | 13.13 | 2.90 | |
experience | 18.15 | 13.04 |
Example 2: Nested columns
afam
|
cauc
|
|||
---|---|---|---|---|
mean | SD | mean | SD | |
wage | 446.85 | 312.44 | 617.23 | 461.21 |
education | 12.33 | 2.77 | 13.13 | 2.90 |
experience | 18.74 | 13.51 | 18.15 | 13.04 |
You can nest variables and statistics inside multiple categorical variables using the *
symbol.
The order in which terms enter the formula determines the order in which labels are displayed.
Example
datasummary(
formula = wage + education + experience ~ region * ethnicity * (mean + SD),
data = CPS1988
)
midwest
|
northeast
|
south
|
west
|
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
afam
|
cauc
|
afam
|
cauc
|
afam
|
cauc
|
afam
|
cauc
|
|||||||||
mean | SD | mean | SD | mean | SD | mean | SD | mean | SD | mean | SD | mean | SD | mean | SD | |
wage | 458.32 | 312.58 | 613.19 | 450.94 | 505.58 | 342.00 | 663.04 | 438.30 | 416.01 | 293.52 | 582.93 | 487.13 | 518.20 | 346.94 | 617.96 | 457.77 |
education | 12.53 | 2.62 | 13.29 | 2.52 | 12.28 | 2.75 | 13.32 | 2.77 | 12.15 | 2.84 | 12.91 | 3.08 | 13.22 | 2.38 | 13.05 | 3.16 |
experience | 18.53 | 13.92 | 17.78 | 12.90 | 20.73 | 14.32 | 18.69 | 13.49 | 18.52 | 13.22 | 18.41 | 13.20 | 16.89 | 12.72 | 17.70 | 12.48 |
By default, variable and statistics names are used as the labels in the table. You can rename the default labels with the following syntax: (label = variable/statistic)
.
Example
Before renaming:
afam
|
cauc
|
|||
---|---|---|---|---|
mean | SD | mean | SD | |
wage | 446.85 | 312.44 | 617.23 | 461.21 |
education | 12.33 | 2.77 | 13.13 | 2.90 |
After renaming:
datasummary(
formula = (`Wage (in dollars per week)` = wage) + (`Years of Education` = education) ~ ethnicity * (mean + (`Std.Dev` = SD)),
data = CPS1988
)
afam
|
cauc
|
|||
---|---|---|---|---|
mean | Std.Dev | mean | Std.Dev | |
Wage (in dollars per week) | 446.85 | 312.44 | 617.23 | 461.21 |
Years of Education | 12.33 | 2.77 | 13.13 | 2.90 |
Caution
``
is used to define a variable name with spaces or special characters such as the parentheses symbol ()
.For this exercise problem, we use CPSSW3
dataset from the AER
package. The CPSSW3
dataset provides trends (from 1992 to 2004) in hourly earnings in the US of working college graduates aged 25–34 (in 2004 USD).
Let’s create the following tables:
Table 1
year |
male
|
female
|
|||
---|---|---|---|---|---|
Mean | SD | Mean | SD | ||
1992 | earnings | 12.38 | 5.88 | 10.61 | 4.92 |
age | 29.74 | 2.81 | 29.67 | 2.81 | |
2004 | earnings | 17.77 | 9.30 | 15.36 | 7.71 |
age | 29.82 | 2.86 | 29.67 | 2.93 |
Table 2: Rename some labels in Table 1
Year |
male
|
female
|
|||
---|---|---|---|---|---|
Mean | Std.Dev | Mean | Std.Dev | ||
1992 | Avg. hourly earnings | 12.38 | 5.88 | 10.61 | 4.92 |
Age | 29.74 | 2.81 | 29.67 | 2.81 | |
2004 | Avg. hourly earnings | 17.77 | 9.30 | 15.36 | 7.71 |
Age | 29.82 | 2.86 | 29.67 | 2.93 |
Table 1
Table 2: Rename some labels in Table 1
datasummary_skim()
with the type = categorical
option might be helpful to quickly generate a summary table for categorical variables:
Syntax
datasummary_balance()
function creates balance tables with summary statistics for different subsets of the data (e.g., control and treatment groups).
Syntax
afam (N=2232)
|
cauc (N=25923)
|
Diff. in Means | Std. Error | |||
---|---|---|---|---|---|---|
Mean | Std. Dev. | Mean | Std. Dev. | |||
wage | 446.9 | 312.4 | 617.2 | 461.2 | 170.4 | 7.2 |
education | 12.3 | 2.8 | 13.1 | 2.9 | 0.8 | 0.1 |
experience | 18.7 | 13.5 | 18.2 | 13.0 | -0.6 | 0.3 |
datasummary_correlation()
function creates a correlation table. It automatically identifies all the numeric variables, and calculates the correlation between each of those variables (You don’t need to select the numeric variables manually!).
Syntax
After running some regression models, ultimately you want to report the results in a neat table.
Usually, we report the regression results in a formatted document like the Rmarkdown or Quarto document (html or PDF).
So, let’s practice how to create a summary table for your analysis results in the Quarto document!
From my GuitHub, “materials” under “Data/Materials for Class Use” download and open the document file “practice_modelsummary_html.qmd”.
See details here - Documents
You can find today’s after-class exercise problems under “Data/Materials for Class Use” on my GuitHub.