Day 1: Basics of R

Qingyin Cai

Department of Applied Economics
University of Minnesota

Slide Guide

  • Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see the table of contents, and you can jump to the section you want to see.
  • Hitting letter “o” on your keyboard and you will have a panel view of all the slides.
  • You can directly write and run R code, and see the output on slides.
  • When you want to execute (run) code, hit command + enter (Mac) or Control + enter (Windows) on your keyboard. Alternatively, you can click the “Run Code” button on the top left corner of the code chunk.



Learning Objectives

  • To understand the R coding rules.
  • To understand the basic types of data and structure in R, and to be able to manipulate them.
  • To be able to use base R functions to do some mathematical calculations.
  • To be able to create R projects and save and load data in R.


Reference

Today’s outline

  1. General coding rules in R
  2. Basic data types in R
  3. Types of Data Structures in R
    1. Vector (one-dimensional array)
    2. Matrix (Two-dimensional array)
    3. Data Frame
    4. List
  4. Matrix/Linear Algebra in R
  5. Loading and Saving Data
  6. Exercise problems
  7. Appendix: Useful base-R functions

Before you start


  • We’ll cover many basic topics today.

  • You don’t need to memorize nor completely understand all the contents in this lecture.

  • At the end of each section, I will include a summary of the key points you need to know. As long as you understand those key points, you are good to go.

General coding rules in R

General coding rules in R

  • R is object-oriented: Everything in R is an “object” that you can name and reuse.

  • Creating objects: Use <- or = to store information in objects.

    • Example: e.g., x <- 1 assigns 1 to an object called x.
  • Objects can be overwritten: If you use the same name twice, the new value replaces the old one.

  • View your objects: Simply type the object name to see what’s stored inside.

Example

  • Object names must start with a letter (not a number or symbol).
  • Use underscores _ or dots . to separate words in names.
  • Choose descriptive names that tell you what the object contains.
    • Good: student_age, exam_scores
    • Avoid: x, data1, thing

  • Packages provide extra functions beyond base R
  • Install once: install.packages("package_name")
  • Load every session: library(package_name)
  • Troubleshooting: See could not find function "xxxx"? → Load the package!

Basic data types in R

Overview

These are the basic data elements in R.

Data Type Description Example
numeric General number, can be integer or decimal. 5.2, 3L (the L makes it integer)
character Text or string data. "Hello, R!"
logical Boolean values. TRUE, FALSE
integer Whole numbers. 2L, 100L
complex Numbers with real and imaginary parts. 3 + 2i
raw Raw bytes. charToRaw("Hello")
factor Categorical data. Can have ordered and unordered categories. factor(c("low", "high", "medium"))


  • Three main data types you’ll use most often: numeric, character, and logical.
  • Text must be in quotes:
    • Correct: "Hello" or 'Hello'
    • Wrong: Hello (without quotes)

Use class() or is.XXX() to examine the data types.

Convert between data types using as.XXX() functions:

Logical values (a.k.a. Boolean values)

  • Logical values are TRUE, FALSE, and NA (not available/undefined).

  • They are often generated by comparison operators: <, >, <=, >=, ==, !=.

  • Logical operators include & (and), | (or), and ! (not).

  • Every comparison evaluates to TRUE, FALSE, or NA.

  • When treated as numbers, TRUE equals 1 and FALSE equals 0.

  • Logical values can be used as indices to subset vectors or data.

Summary


Key points

  • R defines several basic data types, including numeric, character, and logical.

  • Use the class() function to check the data type of an object.

  • Use as.XXX() functions to convert an object from one type to another.

  • Logical values play an important role in many R operations.

Types of Data Structures in R

Types of Data Structures in R

R provides several types of data structures for storing data.


Data Structure Description Creation Function Example
Vector One-dimensional; Holds elements of the same type. c() c(1, 2, 3, 4)
Matrix Two-dimensional; Holds elements of the same type. matrix() matrix(1:4, ncol=2)
Array Multi-dimensional; Holds elements of the same type. array() array(c(1:12), dim = c(2, 3, 2))
List Can hold elements of different types. list() list(name="John", age=30, scores=c(85, 90, 92))
Data Frame Like a table; Each column can hold different data types. This is the most common data structure. data.frame() data.frame(name=c("John", "Jane"), age=c(30, 25))


Vector (one-dimensional array)

  • A vector object is a collection of elements of the same type.
  • Vectors can contain numbers, characters, or logical values.
  • Use c() to create a vector or to combine vectors (c stands for combine).


Basic syntax

c(element1, element2, element3, ...)


You can name each element in a vector:

c(x1 = element1, x2 = element2, x3 = element3, ...)

Vector: How to manipulate?

Basics

  • Use square brackets [] to extract one or more elements from a vector by their position.

  • If a vector has names, you can extract elements using their names.

  • To update an element, assign a new value to the position (or name) you want to change.

Example

  • A logical vector contains only logical values (TRUE and FALSE).
  • Logical vectors can be used as index vectors: only elements matching TRUE are returned.

Example

In-class Exercise

The following code randomly samples 30 numbers from a uniform distribution between 0 and 1, and stores the result in x.

Questions

Matrix (Two-dimensional array)

  • A matrix is a collection of elements of the same type arranged in rows and columns (essentially a vector with an added dimension attribute).

  • In practice, matrices are less common for real-world data storage and are used mainly for linear algebra operations.

  • Use the matrix() function to create a matrix.


Syntax

matrix(data = vector_data, nrow = number_of_rows, ncol = number_of_column, byrow = FALSE)


  • You need to specify the vector_data and the number_of_rows and number_of_columns.

  • If the length of vector_data is a multiple of number_of_columns (or number_of_rows), R fills in the other dimension automatically.

  • By default, values are filled by column. Use byrow = TRUE to fill by row.

You can also create a matrix by combining multiple vectors using cbind() or rbind() functions.

  • rbind() function combines vectors by row.

  • cbind() function combines vectors by column.

Matrix: How to manipulate

  • You can access matrix elements with [].

  • Specify the row index and column index: [row, col].

  • Leave one index blank to select an entire row or column.

Example

You can add column names and row names to a matrix using colnames() and rownames() functions. If a matrix has column names and row names, you can use the names as the index.

Matrix: Exercise Problem (Optional)

Use the following matrix:

Questions

Data Frame

  • A data.frame class object is similar to a matrix, but each column can store a different data type.

  • It is designed for tabular data, which makes it the most common structure in real-world datasets.

Syntax

data.frame(column_1 = vector_1, column_2 = vector_2)

Example

  • If you do not provide column names, R automatically assigns default names (e.g., X1, X2, X3).
  • You can access elements of a data.frame using square brackets [].

  • Specify the row and column index, similar to a matrix.

  • Indexing options include:

    • Positional index (e.g., df[1, 2])
    • Column names (e.g., df[ , "Age"])
    • Logical vectors (e.g., df[df$Age > 20, ])
  • You can extract a single column from a data.frame using the $ or [[ ]] operator.

  • $ and [[ ]] can only return one column at a time as a vector, while [] can select multiple columns.

  • Type ?"$", ?"[", and ?"[[" in the Console for details.

  • Inside [[ ]], provide the column name as a character (e.g., df[["Age"]]).

  • Why this matters: many R functions (mean(), sum(), sqrt(), etc.) work on vectors, and $ / [[ ]] are the fastest way to extract a vector for calculations.


You can add a new column to a data.frame object using the $ operator.

Syntax

data_frame$new_column <- vector_data
  • A new column added to a data.frame must have the same length as the number of rows.

  • If the length does not match, R will recycle the values to fill the column.


In-class Exercise

We will use the built-in dataset mtcars for this exercise. Run the following code to load the data.

Questions

with() and within()

  • The with() function evaluates an expression inside a data frame.
    • Example: with(df_student, mean(Age)) instead of mean(df_student$Age)
  • The within() function is similar, but it allows you to modify the data frame directly.
    • Example: df_student <- within(df_student, { GPA2 <- GPA^2 })
  • Using these functions helps avoid repeatedly typing the data frame name and $.

Example

List

  • A list in R can store elements of different types and sizes: numbers, characters, vectors, matrices, data frames, or even other lists.

  • A list is a flexible container that can hold any combination of data structures.

  • Use the list() function to create a list.

  • You can access list elements using $, [], or [[ ]].

  • [] returns a list containing the selected elements.

  • [[ ]] returns a single element itself (not wrapped in a list).

  • $ is shorthand for [[ ]], but it only works if the list elements are named.

Summary


Key points

  • Know how to create the main data structures in R: vector, matrix, data.frame, and list.
    • Vectors and matrices store one data type.
    • Data frames and lists can store different data types.

  • Learn how to access, subset, and modify elements using indexing.
    • Indexing can be positional, logical, or by name.
    • Operators include [], $, and [[ ]].

Matrix/Linear Algebra in R

  • You do not need to memorize the operators for remainder and quotient.
  • Arithmetic operations on vectors are performed element-wise.

  • This means the operation is applied to elements in the same position of each vector.

  • By default, * does element-wise multiplication.

  • To perform true matrix multiplication, use the %*% operator.

Loading and Saving Data in R

R base functions for data import and export

  • Like other softwares (e.g., Stata, Excel) do, R has two native data formats: .Rdata (or .rdata) and .Rds (or .rds).
    • .Rdata is used to save multiple R objects.
    • .Rds is used to save a single R object.


.Rdata format

  • Load data:

load("path_to_Rdata_file")

  • Save data:

save(object_name1, object_name2, file = "path_to_Rdata_file")

.Rds format

  • Load data:

readRDS("path_to_Rds_file")

  • Save data:

saveRDS(object_name, file = "path_to_Rds_file")

Setting the working directory

To access to the data file, you need to provide the path to the file (the location of the data file).


Example

Suppose that I want to load data_example.rds in the Data folder. On my computer, the full path (i.e., absolute path) to the file is /Users/qingyin/Dropbox/Teaching/R_Review_2025/Data/data_example.rds.

# this code only works in my local machine
df_example <- readRDS(file = "/Users/qingyin/Dropbox/Teaching/R_Review_2025/Data/data_example.rds")


Why avoid hard-coding full paths?

  • Typing the full file path every time is cumbersome and slows you down.

  • Hard-coded paths make your code less portable:

    • Team members may have different folder structures.
    • Code that works on your computer might fail on theirs.
  • The working directory is the folder where R looks for files to load and saves files you create.
  • Check the current working directory with getwd().
  • By default, R uses your home directory (or the project folder if you’re in an R Project).
  • If you often import or save data in a specific folder, it helps to set that folder as the working directory.

  • Use setwd() to change the working directory:

Example

In my case, I set the working directory to the R_Review_2025 folder.

setwd("/Users/qingyin/Dropbox/Teaching/R_Review_2025")

Now, R will look for the data file in the R_Review_2025 folder by default. So, I can load the data using relative path, not absolute path.

df_example <- readRDS(file = "Data/data_example.rds")


Problems

  • setwd() still relies on an absolute path, which can vary across people.
    • e.g., one person saves files in Dropbox, another in Google Drive).
  • This means setwd() does not fully solve the collaboration problem.
    • code may still break if teammates have different folder structures.

“R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects.” - R for Data Science Ch 8.4


RStudio Projects

  • An RStudio project is a way to organize your work.

  • When you open a Project, R automatically sets the working directory to the folder containing the .Rproj file — no need for setwd().

  • As long as the folder structure inside the Project is consistent, you can share code with teammates and relative paths will work for everyone.

Follow this steps illustrated in this document: R for Data Science Ch 8.4

  • In RStudio, check the top-right corner of the window to see the active Project name.

  • Alternatively, open a Project by double-clicking the .Rproj file in Finder (Mac) or File Explorer (Windows).

  • Use getwd() to confirm the current working directory — it should be the Project folder.

  • Load the data_example.rds data file with readRDS().

Loading data other than .Rds (.rds) format

  • R can load data from various formats including .csv, .xls(x), and.dta.
  • There exists many functions that can help you to load data:
    • read.csv() to read a .csv file.
    • read_excel() from the readxl package to read data sheets from an .xls(x) file.
    • read.dta13() function from the readstata13 package to read a STATA data file (.dta).

Use import() function of the rio package

  • But import() function from the rio package might be the most convenient one to load various format of data.
    • Unlike, read.csv() and read.dta13() which specialize in reading a specific type of file, import() can load data from various sources.

In Data folder, data_example data is saved with three different formats: data_example.csv, data_example.dta, and data_example.xlsx. Let’s load the data using import() function on your Rstudio.

Saving the data

  • You can save data in many formats (.csv, .dta, .xlsx, etc.).
  • But unless you need compatibility with other software, it’s best to save data in .rds.
    • How: saveRDS(object_name, path_to_save)


Why prefer .rds?

  • Designed for R — no reason to use another format if you work only in R.
  • Faster and more efficient for saving and loading.
  • Produces smaller file sizes compared to .csv or .xlsx when data gets larger.
  • (Try it! Check the size of the data_example dataset saved in different formats.)


Let’s try!

  • Load the data_example data in the Data folder.

Summary


Key points

  • Rstudio project (.Rproj) is a useful tool to organize your work. As long as the folder structure under the .Rproj is the same, you can share the code involving data loading with your team members.

  • To load data:
    • use readRDS() function for .Rds (.rds) format.
    • you can use import() function from the rio package for various format.


  • To save the data, it is recommended to use .rds format and use saveRDS() function.

After-class Exercise Problems

Exercise Problems 1: Vector

  1. Create a sequence of numbers from 20 to 50 and name it x. Let’s change the numbers that are multiples of 3 to 0.

  2. sample() is commonly used in Monte Carlo simulation in econometrics. Run the following code to create r. What does it do? Use ?sample to find out what the function does.

  1. Find the value of mean and SD of vector r without using mean() and sd().

  2. Figure out which position contains the maximum value of vector r. (use which() function. Run ?which() to find out what the function does.).

  3. Extract the values of r that are larger than 50.

  4. Extract the values of r that are larger than 40 and smaller than 60.

  5. Extract the values of r that are smaller than 20 or larger than 70.

Exercise Problem 2: Data Frame

  1. Load the file nscg17tiny.dta. You can find the data in the Data folder.

    • This data is a subset of the National Survey of College Graduates (NSCG) 2017, which collects data on the educational and occupational characteristics of college graduates in the United States.
  2. Each row corresponds to a unique respondent. Let’s create a new column called “ID”. There are various ways to create an ID column. Here, let’s create an ID column that starts from 1 and increments by 1 for each row.

  3. To take a quick look at the summary statistics of a specific column, summary() function is useful. Use summary() to create a table of the descriptive statistics for hrswk. You’ll provide hrswk column to summary() as a vector.

  4. Create a new variable in your data that represents the z-score of the hours worked (use hrswk variable).
    \(Z = (x - \mu)/\sigma\), where \(Z = \text{standard score}\), \(x =\text{observed value}\), \(\mu = \text{mean of sample}\), and \(\sigma = \text{standard deviation of the sample}\).

  5. Calculate the share of observations in your data sample with above average hours worked.

Appendix: Useful base-R functions

Appendix: A List of Useful R Built-in Functions

Function Description
length() get the length of the vector and list object
nrow(),ncol() get the number of rows or columns
dim() get the dimension of the data
rbind(),cbind() Combine R Objects by rows or columns
colMeans(), rowMeans() calculate the mean of each column or row
with and within() You don’t need to use `—


Function Description
sum(), mean(), var(), sd(), cov(), cor(), max(), min(), abs(), round()
log() and exp() Logarithms and Exponentials
sqrt() Computes the square root of the specified float value.
seq() Generate a sequence of numbers
sample() randomly sample from a vector
rnorm() generate random numbers from normal distribution
runif() generate random numbers from uniform distribution