[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
Department of Applied Economics
University of Minnesota
In the previous lecture, we learned how to do regression analysis using R, which is a fundamental skill for econometric analysis.
Today, we will learn how to code Monte Carlo simulations in R. Monte Carlo simulations are a very important tool in learning econometrics and statistics. With Monte Carlo simulations, you can test any kind of statistical theory or property, which is very useful!!
Note
Before we start the Monte Carlo simulation, let’s review a few key R concepts:
for loop
function: how they work and when to use them.
Writing your own functions: we won’t need this for today’s simulation, but it’s an important skill to practice for future coding tasks.
You can define your own functions. The beauty of creating your own functions is that you can reuse the same code over and over again without having to rewrite it. A function is more useful when the task is longer and more complicated.
Example Situations
You can define your own functions using the function()
function.
General Syntax
Note
You need to define the function name (function_name
), what kind of inputs the function takes (arg1
, arg2
, etc.), and how the function processes using the given input objects.
The return()
function is used to return the output of the function. By default, the output defined in the last line of the function is returned.
1. A simple function
2. A function with multiple outputs
You can set default values for function arguments by argument = value
.
Example:
If you have multiple functions or a long function, you might want to save the function in a separate file and load it when you need it.
For example:
.R
file (.Rmd
, etc.).
File
-> Save As...
my_function.R
.source()
function.Let’s practice this on your Rstudio!
Write a function (you can name it whatever you want) to calculate the area of a circle with a given radius. The function should return the area of the circle. (Hint1: Area = π × r^2
; Hint2: Use pi
, which is a built-in constant for the value of \(\pi\) in R.)
Write a function to count the number of NA values in a given vector. (Hint: use the is.na()
function.)
Write a function called mean_no_na
that calculates the mean (average) of a numeric vector, but ignores any NA values.
Write a function (you can name it whatever you want) to calculate the area of a circle with a given radius. The function should return the area of the circle. (Hint1: Area = π × r^2
; Hint2: Use pi
, which is a built-in constant for the value of \(\pi\) in R.)
Write a function to count the number of NA values in a given vector. (Hint: use the is.na()
function.)
Write a function called mean_no_na
that calculates the mean (average) of a numeric vector, but ignores any NA values.
You’re a data expert at a store chain. The company needs to study its monthly sales growth to plan better. They expect sales to grow by a fixed percentage each month. Your job is to create an R function that shows sales growth over a year.
For sales growth, use the following formula:
\[S_t = S_0 \times (1 + g)^{t-1}\]
, where \(S_t\) is the sales in month \(t\) , \(S_0\) is the starting sales, and \(g\) is the growth rate.
Create a function called monthly_sales_growth
with the following three inputs:
initial_sales
: Starting sales (in thousands of dollars).growth_rate
: Monthly growth rate (as a decimal, like 0.03 for 3% growth).months
: How many months to predict (usually 12 for a year).The function should give back a vector of numbers (or it would be even better if you could show in a data.frame
or data.table
in which two columns, e.g., month and sales, show the expected sales for each month.)
You’re a data expert at a store chain. The company needs to study its monthly sales growth to plan better. They expect sales to grow by a fixed percentage each month. Your job is to create an R function that shows sales growth over a year.
For sales growth, use the following formula:
\[S_t = S_0 \times (1 + g)^{t-1}\]
, where \(S_t\) is the sales in month \(t\) , \(S_0\) is the starting sales, and \(g\) is the growth rate.
Create a function called monthly_sales_growth
with the following three inputs:
initial_sales
: Starting sales (in thousands of dollars).growth_rate
: Monthly growth rate (as a decimal, like 0.03 for 3% growth).months
: How many months to predict (usually 12 for a year).The function should give back a vector of numbers (or it would be even better if you could show in a data.frame
or data.table
in which two columns, e.g., month and sales, show the expected sales for each month.)
Using Loop is useful when you want to repeat the same task (but with a slight change in parameters) over and over again.
Examples
While there are several looping commands in R (e.g., foreach
, lapply
, etc.), we will use the for loop
function, as it is the most basic and widely used looping function in R.
The for loop
function is a built-in R function. The syntax of the for loop
is very simple.
Syntax:
Components
name
(loop variable): a placeholder that takes on one value from the collection each time the loop runs.collection
: the set of values you want to loop over (usually a vector or list).code block
: the instructions you want R to execute at each step.Note
On each iteration, the loop variable (name
) takes the next value from the collection.
The collection can be:
1:5
)c("apple", "banana")
)1. Print the numbers from 1 to 5.
Loop variable i
takes each number in the sequence 1:5
in order and print the value of i
.
2. Print characters in a list.
Loop variable x
takes each character in the list list("I", "like", "cats")
in order and print the value of x
.
3. Calculate the mean of each element in a list.
Can you tell me what’s going on in the following code?
In econometric class, we use the rnorm()
function a lot! It is a function that generates random numbers from a normal distribution. See ?rnorm
for more details.
The basic syntax is rnorm(n, mean = 0, sd = 1)
, where n
is the number of random numbers you want to generate, mean
is the mean of the normal distribution, and sd
is the standard deviation of the normal distribution. So rnorm(n, mean =0, sd = 1)
generates n
random numbers from a standard normal distribution.
Generate 1000 random numbers from a standard normal distribution and calculate the mean the numbers (use mean()
function), and print the results. Repeat this process 10 times using the for loop
.
In econometric class, we use the rnorm()
function a lot! It is a function that generates random numbers from a normal distribution. See ?rnorm
for more details.
The basic syntax is rnorm(n, mean = 0, sd = 1)
, where n
is the number of random numbers you want to generate, mean
is the mean of the normal distribution, and sd
is the standard deviation of the normal distribution. So rnorm(n, mean = 0, sd = 1)
generates n
random numbers from a standard normal distribution.
Generate 1000 random numbers from a standard normal distribution and calculate the mean the numbers (use mean()
function), and print the results. Repeat this process 10 times using the for loop
.
You can nest the for loop
inside another for loop
. For example,
Using the above code as a reference, fill in the following empty 3 x 3 matrix with the sum of the row and column indices.
The output should look like this:
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
You can nest the for loop
inside another for loop
. For example,
Using the above code as a reference, fill in the following empty 3 x 3 matrix with the sum of the row and column indices.
The output should look like this:
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
Unlike R functions we have seen so far, for loop
does not have a return value. It just iterates the process we defined in the loop.
Let’s do some experiments:
Every round of the loop, the variables defined inside the loop are updated.
x <- for (i in 1:3){print(i)}
does not work).Example
Suppose you want to cube each number in the sequence 1:5
.
Note
What if we want to have multiple outputs from the loop and combine them into a single dataset?
Example
Let’s generate 100 random numbers from a standard normal distribution and calculate the mean and the standard deviation of numbers. Repeat this process 10 times using the for loop
and save the results in a dataset.
Using the for loop
, calculate the sum of the first n
numbers for n = 1, 2, ..., 10
. For example, the sum of the first 3 numbers (n=3) is 1 + 2 + 3 = 6. Save the results in a vector object.
Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones (e.g. 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …). Write a function that generates the first n
numbers in the Fibonacci sequence. (You use the for loop
function inside the function.) For example, when n = 5
, the function should return c(0, 1, 1, 2, 3)
. For simplicity, let’s assume that \(n \ge 3\) (You don’t need to consider the case where n = 1 or 2).
Using the for loop
, calculate the sum of the first n
numbers for n = 1, 2, ..., 10
. For example, the sum of the first 3 numbers (n=3) is 1 + 2 + 3 = 6. Save the results in a vector object.
Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones (e.g. 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …). Write a function that generates the first n
numbers in the Fibonacci sequence. (You use the for loop
function inside the function.) For example, when n = 5
, the function should return c(0, 1, 1, 2, 3)
. For simplicity, let’s assume that \(n \ge 3\) (You don’t need to consider the case where n = 1 or 2).
NOTE: It’s okay if you cannot solve this problem! I will show two approaches to solve this problem, and do speed comparison between the two approaches.
Pythagorean triples are sets of three positive integers \((a, b, c)\) that satisfy the equation \(a^2 + b^2 = c^2\), who are named after the ancient Greek mathematician Pythagoras.
Let’s take this concept further. Suppose Pythagoras challenges you to find all possible Pythagorean triples where \(a\) and \(b\) are less than or equal to a given number \(n\). To address this problem, let’s create an R function that will produce all such triples.
n
, an integer, representing the maximum value for a
and b
. The function should return a data frame with columns a
, b
, and c
, containing all Pythagorean triples where \(b \leq a \leq n\) and \(a^2 + b^2 = c^2\).Hints:
Reference: Pythagorean Triples
Up to this point, as long as you understand the following points, you are good to go!
function()
to define a simple function yourself.for loop
(i.e., syntax, which argument you need to define).Imagine you want to know the chance of winning a dice game. Instead of trying to calculate the exact probability with formulas, you could just roll the dice thousands of times, record the outcomes, and then see how often you win.
That’s basically what Monte Carlo Simulation does: it uses repeated random experiments to approximate answers to problems that are hard to solve analytically.
Key ideas:
Monte Carlo simulation is a technique to approximate a likelihood of possible outcomes (e.g., predictions, estimates) from a model by iteratively running the model on artificially created datasets. In every iteration, the datasets are randomly sampled from the assumed data generating process, so it varies every iteration.
In econometrics, the Monte Carlo simulation is used to evaluate the performance of a statistical procedure or the validity of theories in a realistic setting.
For example
Suppose that a researcher came up with a new estimator to estimate the coefficients of a regression model.
Think about the following example.
Example
This kind of experiment is modeled by the binomial distribution. According to the theory, it is predicted that
Is it true? Let’s check this using a Monte Carlo simulation!
step 1: Specify the data generating process.
step 2: Repeat:
step 3: compare your estimates with the true parameter
Let’s start writing code for a single iteration to get an idea of the Monte Carlo simulation process in R.
Weak law of large number states that the sample mean converges (in probability) to the population mean as the sample size increases. In other words, the sample mean more accurately estimates the population mean as the sample size increases.
\[\bar{X}_n = \frac{1}{n}\sum_{n=1}^{n} X_i E[X] \xrightarrow{p} E[X]\]
Lets check this using Monte Carlo simulation! We compare the distribution of sample mean with different sample size. Let’s compare two sample sizes: \(n=100\) and \(n=1000\).
Process
Repeat 1 and 2 for \(B=1000\) times.
Using a normal distribution with mean \(\mu = 5\) and \(sd = 10\), generate random numbers for \(n=100\) and \(n=1000\). e.g. rnorm(n = 10, mean = 5, sd = 10)
.
Compute sample mean for each sample data, and save them.
Finally,
Suppose you’re interested in estimating the unknown population mean of men’s heights (i.e., \(\mu\)) in the US. We have randomly sampled data with the size of \(n=1000\). Let \(X_i\) denote the individual \(i\)’s height in the sample. How should we use the sample data to estimate the population mean?
Your friends suggested two different estimators:
Estimator 1. Use the sample mean: \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i\)
Estimator 2. Use only the first observation: \(X_1\)
Theoretically, both estimators are unbiased estimators (i.e., if we repeat the estimation process many times on a different sample data every time, the average of the estimates will be equal to the population mean):
\[\begin{align*} E[\bar{X}_n] &= E \left[\frac{1}{n} \sum_{i=1}^{n} \right] = \frac{1}{n} E \left[\sum_{i=1}^{n} X_i \right] = \frac{1}{n} \sum_{i=1}^{n} E[X_i] = \frac{1}{n} \cdot n \cdot \mu = \mu \\ E[X_1] &= \mu \end{align*}\]
Questions:
Using Monte Carlo simulation, let’s examine these questions!
step 1. Draw \(n = 1000\) random numbers with known mean \(\mu\) and standard deviation \(\sigma\). This will be the sample data.
step 2. Get (1) the mean of the sample and (2) the value of the first observation, and save the results.
If you could also visually compare the distribution of estimates from the two estimators, that would be great!
foreach
FunctionThe foreach
function is a function of the foreach
package. It is used to iterate the same process over and over again, similar to the for loop
function.
Basic Syntax
While there are some differences, the basic syntax of the foreach
function is pretty much similar to the for loop
function.
foreach(variable = collection_of_objects) %do% {
the code to be executed in each iteration
return(output)
}
Note
for loop
and foreach
function:
=
instead of in
.%do%
operator.foreach
function has a return value, while for loop
does not. By default, the output is returned as a list.foreach
function also supports parallel processing. (we will not cover this in this class.)Using the for loop
and foreach
function, let’s calculate the square of the numbers from 1 to 10, respectively.
foreach
for loop
By default, foreach
function returns each iteration’s result as a list. But you can choose the format of the output by using the .combine
argument.
.combine = c
combines each iteration’s result as as a vector (like c()
to create a vector)..combine = rbind
combines each iteration’s result by row..combine = cbind
combines each iteration’s result by column.The last two options are used when the output is a matrix
or a data.frame
(or data.table
).
Example
.combine
options in the following code.data.table
ggplot2
Coding rules & projects: file paths, working directories, using .Rproj
.
Data types: numeric, character, logical; handling NA
.
Core structures: vectors, matrices, data frames — creation, indexing, subsetting.
Base math & objects: assignment, arithmetic, functions; saving/loading data.
DT[i, j, by]
pattern and chaining.
Row filtering and column selection; create/modify columns with :=
.
Grouped summaries with by
; ordering and efficient aggregation.
Reshaping with melt()
and dcast()
(wide ↔︎ long).
Grammar of Graphics: data
+ aes()
+ geom_*
.
Common plots: scatter, line, bar, histogram, box, density; faceting.
Aesthetics: color/size/shape, grouping, collective geoms.
Labels/themes: axes, titles, legends; layering multiple datasets.
Linear models with lm()
.
modelsummary
for publish-ready regression tables; datasummary
for descriptives.
Customization: model labels, stats formatting, notes.
Quarto reporting: render to HTML/PDF for reproducible outputs.
User-defined functions function()
: arguments, returns, defaults; save in .R and source()
.
Loops function for()
: basics, saving outputs, multiple outputs.
Randomness & reproducibility: set.seed()
, rnorm()
, sample(..., replace=TRUE)
.
This exercise integrates all major concepts from our R course. We will use the built-in R datasets mtcars
provided.
Setup (Run this first!)
Show the structure of mtcars_dt
Calculate the mean mpg (miles per gallon) for all cars
Create a logical vector showing which cars have both mpg > 20 AND hp < 150
Extract the car names of cars with the highest mpg
Create a new column ‘power_ratio’ = hp/wt (horsepower per weight) and add it to mtcars_dt
Replace all mpg values less than 15 with NA
Filter mtcars_dt to show only cars with automatic transmission (am == 0)
Sort mtcars_dt by mpg in descending order and show top 5 cars
Calculate average mpg by number of cylinders (cyl)
For each transmission type (am), calculate: count of cars, mean mpg, mean hp
Create a scatter plot of wt vs mpg, with:
Different colors for transmission type (am).
Different shapes for number of cylinders (cyl).
A smooth trend line for each transmission type.
Run a regression: mpg ~ hp + wt + factor(cyl)
Extract the coefficient for hp from your model
Run a second model: mpg ~ hp + wt + factor(am), and create a comparison table using modelsummary() for both models
Show the structure of mtcars_dt
Calculate the mean mpg (miles per gallon) for all cars
Find how many cars have mpg > 20
Create a new column ‘power_ratio’ = hp/wt (horsepower per weight) and add it to mtcars_dt
Filter mtcars_dt to show only cars with automatic transmission (am == 0)
Sort mtcars_dt by mpg in descending order and show top 5 cars
For each transmission type (am), calculate: count of cars, mean mpg, mean hp
Create a scatter plot of wt vs mpg, with:
Different colors for transmission type (am).
A smooth trend line for each transmission type.
Run a regression: mpg ~ hp + wt + factor(cyl)
Extract the coefficient for hp from your model
Run a second model: mpg ~ hp + wt + factor(am), and create a comparison table using modelsummary() for both models