Day 5: Function, Loops, and Monte Carlo Simulations

Qingyin Cai

Department of Applied Economics
University of Minnesota

Introduction

In the previous lecture, we learned how to do regression analysis using R, which is a fundamental skill for econometric analysis.
Today, we will learn how to code Monte Carlo simulations in R. Monte Carlo simulations are a very important tool in learning econometrics and statistics. With Monte Carlo simulations, you can test any kind of statistical theory or property, which is very useful!!

Note

Before we start the Monte Carlo simulation, let’s review a few key R concepts:

for loop function: how they work and when to use them.
Writing your own functions: we won’t need this for today’s simulation, but it’s an important skill to practice for future coding tasks.

Learning Objectives

to be able to write code for your own R functions.
to be able to write code for a Monte Carlo simulation using the loop function.

Reference

Section 19 Functions
Section 20 Iteration in R for Data Science

Today’s Outline

User-Defined Functions
Loops
Introduction to Monte Carlo Simulations
- Demonstration
Overall Review

User-Defined Functions

You can define your own functions. The beauty of creating your own functions is that you can reuse the same code over and over again without having to rewrite it. A function is more useful when the task is longer and more complicated.

Example Situations

When you want to automate the process of data cleaning.
When you do a complicated simulation or resampling methods, such as bootstrapping or Monte Carlo simulations.

You can define your own functions using the function() function.

General Syntax

function_name <- function(arg1, arg2, ...){
  code to be executed

  return(output)
}

Note

You need to define the function name (function_name), what kind of inputs the function takes (arg1, arg2, etc.), and how the function processes using the given input objects.
The return() function is used to return the output of the function. By default, the output defined in the last line of the function is returned.

function_name <- function(arg1, arg2, ...){
  code to be executed

  return(output)
}

1. A simple function

2. A function with multiple outputs

You can set default values for function arguments by argument = value.

Example:

Load Functions from a File

If you have multiple functions or a long function, you might want to save the function in a separate file and load it when you need it.

For example:

Save the function code file in a .R file (.Rmd, etc.).
- Click File -> Save As...
- Choose a name, e.g., my_function.R.
Load the function using the source() function.

Let’s practice this on your Rstudio!

In-class Exercise

Exercise 1
Exercise 2 (optional)

Questions
Answers

Write a function (you can name it whatever you want) to calculate the area of a circle with a given radius. The function should return the area of the circle. (Hint1: Area = π × r^2; Hint2: Use pi, which is a built-in constant for the value of \(\pi\) in R.)
Write a function to count the number of NA values in a given vector. (Hint: use the is.na() function.)
Write a function called mean_no_na that calculates the mean (average) of a numeric vector, but ignores any NA values.

Write a function (you can name it whatever you want) to calculate the area of a circle with a given radius. The function should return the area of the circle. (Hint1: Area = π × r^2; Hint2: Use pi, which is a built-in constant for the value of \(\pi\) in R.)
Write a function to count the number of NA values in a given vector. (Hint: use the is.na() function.)
Write a function called mean_no_na that calculates the mean (average) of a numeric vector, but ignores any NA values.

Questions
Answers

You’re a data expert at a store chain. The company needs to study its monthly sales growth to plan better. They expect sales to grow by a fixed percentage each month. Your job is to create an R function that shows sales growth over a year.

For sales growth, use the following formula:

\[S_t = S_0 \times (1 + g)^{t-1}\]

, where \(S_t\) is the sales in month \(t\) , \(S_0\) is the starting sales, and \(g\) is the growth rate.

Create a function called monthly_sales_growth with the following three inputs:

initial_sales: Starting sales (in thousands of dollars).
growth_rate: Monthly growth rate (as a decimal, like 0.03 for 3% growth).
months: How many months to predict (usually 12 for a year).

The function should give back a vector of numbers (or it would be even better if you could show in a data.frame or data.table in which two columns, e.g., month and sales, show the expected sales for each month.)

For sales growth, use the following formula:

\[S_t = S_0 \times (1 + g)^{t-1}\]

, where \(S_t\) is the sales in month \(t\) , \(S_0\) is the starting sales, and \(g\) is the growth rate.

Create a function called monthly_sales_growth with the following three inputs:

initial_sales: Starting sales (in thousands of dollars).
growth_rate: Monthly growth rate (as a decimal, like 0.03 for 3% growth).
months: How many months to predict (usually 12 for a year).

Loops

Why loop?

Using Loop is useful when you want to repeat the same task (but with a slight change in parameters) over and over again.

Examples

Downloading the data from the web iteratively.
- When you want to download the ag-production data from USDA-NASS, you are limited to download 50,000 records per query. You need to repeatedly download the data until you get all the data you need.
- USDA crop scale data, NOAA weather data, etc.
Loading multiple data files in a folder.
Running the same regression analysis for multiple datasets.
Running simulations or resampling methods, such as bootstrapping or Monte Carlo simulations.

While there are several looping commands in R (e.g., foreach, lapply, etc.), we will use the for loop function, as it is the most basic and widely used looping function in R.

For loops

Basics
Examples

The for loop function is a built-in R function. The syntax of the for loop is very simple.

Syntax:

for (name in collection) {
  # code to run for each element
}

Components

name (loop variable): a placeholder that takes on one value from the collection each time the loop runs.
collection: the set of values you want to loop over (usually a vector or list).
code block: the instructions you want R to execute at each step.

Note

On each iteration, the loop variable (name) takes the next value from the collection.
The collection can be:
- A numeric sequence (e.g., 1:5)
- A character vector (e.g., c("apple", "banana"))
- A list of objects (e.g., datasets)

1. Print the numbers from 1 to 5.

Loop variable i takes each number in the sequence 1:5 in order and print the value of i.

2. Print characters in a list.

Loop variable x takes each character in the list list("I", "like", "cats") in order and print the value of x.

3. Calculate the mean of each element in a list.

Can you tell me what’s going on in the following code?

In-class Exercise

Exercise 1
Exercise 2 (nested loop)

Questions
Answers

In econometric class, we use the rnorm() function a lot! It is a function that generates random numbers from a normal distribution. See ?rnorm for more details.

The basic syntax is rnorm(n, mean = 0, sd = 1), where n is the number of random numbers you want to generate, mean is the mean of the normal distribution, and sd is the standard deviation of the normal distribution. So rnorm(n, mean =0, sd = 1) generates n random numbers from a standard normal distribution.

Generate 1000 random numbers from a standard normal distribution and calculate the mean the numbers (use mean() function), and print the results. Repeat this process 10 times using the for loop.

In econometric class, we use the rnorm() function a lot! It is a function that generates random numbers from a normal distribution. See ?rnorm for more details.

The basic syntax is rnorm(n, mean = 0, sd = 1), where n is the number of random numbers you want to generate, mean is the mean of the normal distribution, and sd is the standard deviation of the normal distribution. So rnorm(n, mean = 0, sd = 1) generates n random numbers from a standard normal distribution.

Generate 1000 random numbers from a standard normal distribution and calculate the mean the numbers (use mean() function), and print the results. Repeat this process 10 times using the for loop.

Questions
Answers

You can nest the for loop inside another for loop. For example,

Using the above code as a reference, fill in the following empty 3 x 3 matrix with the sum of the row and column indices.

The output should look like this:

     [,1] [,2] [,3]
[1,]    2    3    4
[2,]    3    4    5
[3,]    4    5    6

You can nest the for loop inside another for loop. For example,

Using the above code as a reference, fill in the following empty 3 x 3 matrix with the sum of the row and column indices.

The output should look like this:

     [,1] [,2] [,3]
[1,]    2    3    4
[2,]    3    4    5
[3,]    4    5    6

For loops: How to Save the loop output?

Introduction
Basics

Unlike R functions we have seen so far, for loop does not have a return value. It just iterates the process we defined in the loop.

Let’s do some experiments:

Experiment 1
Experiment 2

Every round of the loop, the variables defined inside the loop are updated.

You cannot assign loop to a variable directly (e.g x <- for (i in 1:3){print(i)} does not work).

To save the results of the loop, you need to create an empty object before the loop and save the output in the object in each iteration. (You did this in the exercise 2!)
- The object can be a vector, a list, a matrix, or a data frame (or data.table), depending on the type of the output you want to save.

Example

Suppose you want to cube each number in the sequence 1:5.

Note

Since the output of each iteration is a number, vector is a good choice for the storage object. (Alternatively you can use a list object.)

Multiple Outputs

What if we want to have multiple outputs from the loop and combine them into a single dataset?

Example

Let’s generate 100 random numbers from a standard normal distribution and calculate the mean and the standard deviation of numbers. Repeat this process 10 times using the for loop and save the results in a dataset.

In-class Exercise

Exercise 1
Exercise 2 (optional)

Questions
Answers

Using the for loop, calculate the sum of the first n numbers for n = 1, 2, ..., 10. For example, the sum of the first 3 numbers (n=3) is 1 + 2 + 3 = 6. Save the results in a vector object.
Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones (e.g. 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …). Write a function that generates the first n numbers in the Fibonacci sequence. (You use the for loop function inside the function.) For example, when n = 5, the function should return c(0, 1, 1, 2, 3). For simplicity, let’s assume that \(n \ge 3\) (You don’t need to consider the case where n = 1 or 2).

Using the for loop, calculate the sum of the first n numbers for n = 1, 2, ..., 10. For example, the sum of the first 3 numbers (n=3) is 1 + 2 + 3 = 6. Save the results in a vector object.
Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones (e.g. 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …). Write a function that generates the first n numbers in the Fibonacci sequence. (You use the for loop function inside the function.) For example, when n = 5, the function should return c(0, 1, 1, 2, 3). For simplicity, let’s assume that \(n \ge 3\) (You don’t need to consider the case where n = 1 or 2).

Questions
Answers

NOTE: It’s okay if you cannot solve this problem! I will show two approaches to solve this problem, and do speed comparison between the two approaches.

Pythagorean triples are sets of three positive integers \((a, b, c)\) that satisfy the equation \(a^2 + b^2 = c^2\), who are named after the ancient Greek mathematician Pythagoras.

Let’s take this concept further. Suppose Pythagoras challenges you to find all possible Pythagorean triples where \(a\) and \(b\) are less than or equal to a given number \(n\). To address this problem, let’s create an R function that will produce all such triples.

Write a function that takes one argument n, an integer, representing the maximum value for a and b. The function should return a data frame with columns a, b, and c, containing all Pythagorean triples where \(b \leq a \leq n\) and \(a^2 + b^2 = c^2\).

Hints:

Consider using nested loops to iterate through all possible values of a and b up to n.
Use the sqrt() function to calculate the potential value of c, and check if it’s an integer.
Use the floor() function to round down the value of c.

Reference: Pythagorean Triples

Check Point

Up to this point, as long as you understand the following points, you are good to go!

You know how to use function() to define a simple function yourself.
You know how to use for loop (i.e., syntax, which argument you need to define).
You know that you need to prepare an empty object to save the output of the loop.

Introduction to Monte Carlo Simulations

Analogy
What is it?
Monte Carlo simulation in Econometrics

Imagine you want to know the chance of winning a dice game. Instead of trying to calculate the exact probability with formulas, you could just roll the dice thousands of times, record the outcomes, and then see how often you win.
That’s basically what Monte Carlo Simulation does: it uses repeated random experiments to approximate answers to problems that are hard to solve analytically.
Key ideas:
- We model uncertainty with randomness.
- We simulate many scenarios (sometimes thousands or millions).
- We average the results to estimate probabilities, risks, or outcomes.

Monte Carlo simulation is a technique to approximate a likelihood of possible outcomes (e.g., predictions, estimates) from a model by iteratively running the model on artificially created datasets. In every iteration, the datasets are randomly sampled from the assumed data generating process, so it varies every iteration.

The incorporation of randomness in the simulation is the key feature of the Monte Carlo simulation. It is mimicking the randomness of real-world phenomena.

In econometrics, the Monte Carlo simulation is used to evaluate the performance of a statistical procedure or the validity of theories in a realistic setting.

For example

Suppose that a researcher came up with a new estimator to estimate the coefficients of a regression model.

An estimator (e.g, sample mean, standard error, OLS) is a function of a random variable, therefore it is also a random variable.
A random variable has its own probability distribution.
So, to understand the performance of the estimator (e.g., unbiasedness and efficiency), we need to examine the properties of the probability distribution of the estimator.
We use Monte Carlo simulation to approximate the probability distribution of the estimator!

Example: Binomial Distribution

Think about the following example.

Example

Suppose that we flip a coin \(n=10\) times and count the number of heads. Let’s denote the number of heads \(X\).
The coin is not fair, however. The probability of getting a head is \(p= Pr[heads] = 1/3\).
Suppose that you repeat this experiment \(1000\) times. What is the mean and the variance of \(X\)?

This kind of experiment is modeled by the binomial distribution. According to the theory, it is predicted that

Mean of \(X\) is \(E[X] = np = 10 \times 1/3 = 3.33\)
Variance of \(X\) is \(Var[X] = np(1-p) = 10 \times 1/3 \times 2/3 = 2.22\)

Is it true? Let’s check this using a Monte Carlo simulation!

Monte Carlo Simulation: Steps

step 1: Specify the data generating process.

You need to pick a specific probability distribution to generate a random number.

step 2: Repeat:

step 2.1: generate a (pseudo) random sample data based on the data generating process.
step 2.2: get an outcome you are interested in based on the generated data.

step 3: compare your estimates with the true parameter

Demonstration: Binomial Distribution

A Single Iteration
Multiple Iterations

Let’s start writing code for a single iteration to get an idea of the Monte Carlo simulation process in R.

We want to repeat this 1000 times.

In-class Exercise Problem

Exercise Problem: (Weak) Low of Large Number

Questions
Answers

Weak law of large number states that the sample mean converges (in probability) to the population mean as the sample size increases. In other words, the sample mean more accurately estimates the population mean as the sample size increases.

\[\bar{X}_n = \frac{1}{n}\sum_{n=1}^{n} X_i E[X] \xrightarrow{p} E[X]\]

Lets check this using Monte Carlo simulation! We compare the distribution of sample mean with different sample size. Let’s compare two sample sizes: \(n=100\) and \(n=1000\).

Process

Repeat 1 and 2 for \(B=1000\) times.

Using a normal distribution with mean \(\mu = 5\) and \(sd = 10\), generate random numbers for \(n=100\) and \(n=1000\). e.g. rnorm(n = 10, mean = 5, sd = 10).
Compute sample mean for each sample data, and save them.

Finally,

Plot histograms of the sample means obtained from the two samples.

Exercise Problem: Two estimators to estimate the population mean? (optional)

Questions
Hint
Answers

Suppose you’re interested in estimating the unknown population mean of men’s heights (i.e., \(\mu\)) in the US. We have randomly sampled data with the size of \(n=1000\). Let \(X_i\) denote the individual \(i\)’s height in the sample. How should we use the sample data to estimate the population mean?

Your friends suggested two different estimators:

Estimator 1. Use the sample mean: \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i\)

Estimator 2. Use only the first observation: \(X_1\)

Theoretically, both estimators are unbiased estimators (i.e., if we repeat the estimation process many times on a different sample data every time, the average of the estimates will be equal to the population mean):

\[\begin{align*} E[\bar{X}_n] &= E \left[\frac{1}{n} \sum_{i=1}^{n} \right] = \frac{1}{n} E \left[\sum_{i=1}^{n} X_i \right] = \frac{1}{n} \sum_{i=1}^{n} E[X_i] = \frac{1}{n} \cdot n \cdot \mu = \mu \\ E[X_1] &= \mu \end{align*}\]

Questions:

Is it true that both estimators are correctly estimating the population mean, on average?
Which one is more accurate in estimating the population mean?

Using Monte Carlo simulation, let’s examine these questions!

Repeat the following processes 1000 times:

step 1. Draw \(n = 1000\) random numbers with known mean \(\mu\) and standard deviation \(\sigma\). This will be the sample data.

step 2. Get (1) the mean of the sample and (2) the value of the first observation, and save the results.

The previous iterations produce 1000 estimates of the population mean for estimator 1 and estimator 2, respectively. Compute the means for each estimator. Are they both close to the true population mean? Compute the variance of the estimates. Which one has a smaller variance?

If you could also visually compare the distribution of estimates from the two estimators, that would be great!

Appendix

`foreach` Function

Basics
Example
Change the Output Format

The foreach function is a function of the foreach package. It is used to iterate the same process over and over again, similar to the for loop function.

Basic Syntax

While there are some differences, the basic syntax of the foreach function is pretty much similar to the for loop function.

foreach(variable = collection_of_objects) %do% {
  the code to be executed in each iteration

  return(output)
}

Note

Differences between for loop and foreach function:
- use = instead of in.
- You need to use %do% operator.
- foreach function has a return value, while for loop does not. By default, the output is returned as a list.
foreach function also supports parallel processing. (we will not cover this in this class.)

Using the for loop and foreach function, let’s calculate the square of the numbers from 1 to 10, respectively.

foreach

for loop

By default, foreach function returns each iteration’s result as a list. But you can choose the format of the output by using the .combine argument.

.combine = c combines each iteration’s result as as a vector (like c() to create a vector).
.combine = rbind combines each iteration’s result by row.
.combine = cbind combines each iteration’s result by column.

The last two options are used when the output is a matrix or a data.frame (or data.table).

Example

Try different .combine options in the following code.

Overall Review

Review

Basics of R
Data Manipulation with data.table
Visualization with ggplot2
Regression & Reporting
Functions, Loops, & Monte Carlo

Coding rules & projects: file paths, working directories, using .Rproj.
Data types: numeric, character, logical; handling NA.
Core structures: vectors, matrices, data frames — creation, indexing, subsetting.
Base math & objects: assignment, arithmetic, functions; saving/loading data.

DT[i, j, by] pattern and chaining.
Row filtering and column selection; create/modify columns with :=.
Grouped summaries with by; ordering and efficient aggregation.
Reshaping with melt() and dcast() (wide ↔︎ long).

Grammar of Graphics: data + aes() + geom_*.
Common plots: scatter, line, bar, histogram, box, density; faceting.
Aesthetics: color/size/shape, grouping, collective geoms.
Labels/themes: axes, titles, legends; layering multiple datasets.

Linear models with lm().
modelsummary for publish-ready regression tables; datasummary for descriptives.
Customization: model labels, stats formatting, notes.
Quarto reporting: render to HTML/PDF for reproducible outputs.

User-defined functions function(): arguments, returns, defaults; save in .R and source().
Loops function for(): basics, saving outputs, multiple outputs.
Randomness & reproducibility: set.seed(), rnorm(), sample(..., replace=TRUE).

Review Exercise

Instructions
Questions
Answers

This exercise integrates all major concepts from our R course. We will use the built-in R datasets mtcars provided.

Setup (Run this first!)

Data Structures and Basic Operations
data.table Operations
Data Visualization
Regression Analysis
Programming

Show the structure of mtcars_dt
Calculate the mean mpg (miles per gallon) for all cars
Create a logical vector showing which cars have both mpg > 20 AND hp < 150
Extract the car names of cars with the highest mpg

Create a new column ‘power_ratio’ = hp/wt (horsepower per weight) and add it to mtcars_dt
Replace all mpg values less than 15 with NA
Filter mtcars_dt to show only cars with automatic transmission (am == 0)
Sort mtcars_dt by mpg in descending order and show top 5 cars
Calculate average mpg by number of cylinders (cyl)
For each transmission type (am), calculate: count of cars, mean mpg, mean hp

Create a scatter plot of wt vs mpg, with:
- Different colors for transmission type (am).
- Different shapes for number of cylinders (cyl).
- A smooth trend line for each transmission type.

Run a regression: mpg ~ hp + wt + factor(cyl)
Extract the coefficient for hp from your model
Run a second model: mpg ~ hp + wt + factor(am), and create a comparison table using modelsummary() for both models

Write a function that calculates mpg per cylinder (mpg/cyl)

Use a for loop to calculate the mean mpg for cars with 4, 6, and 8 cylinders. Store results in a vector called ‘cyl_means’

Data Structures and Basic Operations
data.table Operations
Data Visualization
Regression Analysis
Programming

Show the structure of mtcars_dt
Calculate the mean mpg (miles per gallon) for all cars
Find how many cars have mpg > 20

Create a new column ‘power_ratio’ = hp/wt (horsepower per weight) and add it to mtcars_dt
Filter mtcars_dt to show only cars with automatic transmission (am == 0)
Sort mtcars_dt by mpg in descending order and show top 5 cars
For each transmission type (am), calculate: count of cars, mean mpg, mean hp

Create a scatter plot of wt vs mpg, with:
- Different colors for transmission type (am).
- A smooth trend line for each transmission type.

Run a regression: mpg ~ hp + wt + factor(cyl)
Extract the coefficient for hp from your model
Run a second model: mpg ~ hp + wt + factor(am), and create a comparison table using modelsummary() for both models

Write a function that calculates mpg per cylinder (mpg/cyl)

Use a for loop to calculate the mean mpg for cars with 4, 6, and 8 cylinders. Store results in a vector called ‘cyl_means’

Day 5: Function, Loops, and Monte Carlo Simulations

Introduction

Learning Objectives

Reference

Today’s Outline

User-Defined Functions

Introduction to User-Defined Functions

Load Functions from a File

In-class Exercise

Loops

Why loop?

For loops

In-class Exercise

For loops: How to Save the loop output?

Multiple Outputs

In-class Exercise

Check Point

Introduction to Monte Carlo Simulations

Introduction to Monte Carlo Simulations

Example: Binomial Distribution

Monte Carlo Simulation: Steps

Demonstration: Binomial Distribution

In-class Exercise Problem

Exercise Problem: (Weak) Low of Large Number

Exercise Problem: Two estimators to estimate the population mean? (optional)

Appendix

foreach Function

Overall Review

Review

Review Exercise

`foreach` Function