Department of Applied Economics
University of Minnesota
command
+ enter
(Mac) or Control
+ enter
(Windows) on your keyboard. Alternatively, you can click the “Run Code” button on the top left corner of the code chunk.We’ll cover many basic topics today.
You don’t need to memorize nor completely understand all the contents in this lecture.
At the end of each section, I will include a summary of the key points you need to know. As long as you understand those key points, you are good to go.
R is object-oriented: Everything in R is an “object” that you can name and reuse.
Creating objects: Use <-
or =
to store information in objects.
x <- 1
assigns 1 to an object called x
.Objects can be overwritten: If you use the same name twice, the new value replaces the old one.
View your objects: Simply type the object name to see what’s stored inside.
Example
_
or dots .
to separate words in names.student_age
, exam_scores
x
, data1
, thing
install.packages("package_name")
library(package_name)
could not find function "xxxx"
? → Load the package!These are the basic data elements in R.
Data Type | Description | Example |
---|---|---|
numeric | General number, can be integer or decimal. |
5.2 , 3L (the L makes it integer) |
character | Text or string data. | "Hello, R!" |
logical | Boolean values. |
TRUE , FALSE
|
integer | Whole numbers. |
2L , 100L
|
complex | Numbers with real and imaginary parts. | 3 + 2i |
raw | Raw bytes. | charToRaw("Hello") |
factor | Categorical data. Can have ordered and unordered categories. | factor(c("low", "high", "medium")) |
numeric
, character
, and logical
."Hello"
or 'Hello'
Hello
(without quotes)Use class()
or is.XXX()
to examine the data types.
Convert between data types using as.XXX()
functions:
as.numeric()
→ converts to numbersas.character()
→ converts to textas.factor()
→ converts to categoriesLogical values are TRUE
, FALSE
, and NA
(not available/undefined).
They are often generated by comparison operators: <
, >
, <=
, >=
, ==
, !=
.
Logical operators include &
(and), |
(or), and !
(not).
Every comparison evaluates to TRUE
, FALSE
, or NA
.
When treated as numbers, TRUE
equals 1
and FALSE
equals 0
.
Logical values can be used as indices to subset vectors or data.
Key points
R defines several basic data types, including numeric
, character
, and logical
.
Use the class()
function to check the data type of an object.
Use as.XXX()
functions to convert an object from one type to another.
Logical values play an important role in many R operations.
R provides several types of data structures for storing data.
Data Structure | Description | Creation Function | Example |
---|---|---|---|
Vector | One-dimensional; Holds elements of the same type. | c() |
c(1, 2, 3, 4) |
Matrix | Two-dimensional; Holds elements of the same type. | matrix() |
matrix(1:4, ncol=2) |
Array | Multi-dimensional; Holds elements of the same type. | array() |
array(c(1:12), dim = c(2, 3, 2)) |
List | Can hold elements of different types. | list() |
list(name="John", age=30, scores=c(85, 90, 92)) |
Data Frame | Like a table; Each column can hold different data types. This is the most common data structure. | data.frame() |
data.frame(name=c("John", "Jane"), age=c(30, 25)) |
c()
to create a vector or to combine vectors (c
stands for combine).Basic syntax
You can name each element in a vector:
Basics
Use square brackets []
to extract one or more elements from a vector by their position.
If a vector has names, you can extract elements using their names.
To update an element, assign a new value to the position (or name) you want to change.
Example
TRUE
and FALSE
).TRUE
are returned.Example
The following code randomly samples 30 numbers from a uniform distribution between 0 and 1, and stores the result in x
.
Questions
A matrix is a collection of elements of the same type arranged in rows and columns (essentially a vector with an added dimension attribute).
In practice, matrices are less common for real-world data storage and are used mainly for linear algebra operations.
Use the matrix()
function to create a matrix.
Syntax
You need to specify the vector_data
and the number_of_rows
and number_of_columns
.
If the length of vector_data
is a multiple of number_of_columns
(or number_of_rows
), R fills in the other dimension automatically.
By default, values are filled by column. Use byrow = TRUE
to fill by row.
You can access matrix elements with []
.
Specify the row index and column index: [row, col]
.
Leave one index blank to select an entire row or column.
Example
You can add column names and row names to a matrix using colnames()
and rownames()
functions. If a matrix has column names and row names, you can use the names as the index.
Use the following matrix:
Questions
A data.frame
class object is similar to a matrix, but each column can store a different data type.
It is designed for tabular data, which makes it the most common structure in real-world datasets.
Syntax
Example
X1
, X2
, X3
).You can access elements of a data.frame
using square brackets []
.
Specify the row and column index, similar to a matrix.
Indexing options include:
df[1, 2]
)df[ , "Age"]
)df[df$Age > 20, ]
)You can extract a single column from a data.frame
using the $
or [[ ]]
operator.
$
and [[ ]]
can only return one column at a time as a vector, while []
can select multiple columns.
Inside [[ ]]
, provide the column name as a character (e.g., df[["Age"]]
).
Why this matters: many R functions (mean()
, sum()
, sqrt()
, etc.) work on vectors, and $
/ [[ ]]
are the fastest way to extract a vector for calculations.
You can add a new column to a data.frame
object using the $
operator.
Syntax
A new column added to a data.frame
must have the same length as the number of rows.
If the length does not match, R will recycle the values to fill the column.
We will use the built-in dataset mtcars
for this exercise. Run the following code to load the data.
Questions
with()
function evaluates an expression inside a data frame.
with(df_student, mean(Age))
instead of mean(df_student$Age)
within()
function is similar, but it allows you to modify the data frame directly.
df_student <- within(df_student, { GPA2 <- GPA^2 })
$
.Example
A list in R can store elements of different types and sizes: numbers, characters, vectors, matrices, data frames, or even other lists.
A list is a flexible container that can hold any combination of data structures.
Use the list()
function to create a list.
You can access list elements using $
, []
, or [[ ]]
.
[]
returns a list containing the selected elements.
[[ ]]
returns a single element itself (not wrapped in a list).
$
is shorthand for [[ ]]
, but it only works if the list elements are named.
Key points
vector
, matrix
, data.frame
, and list
.
[]
, $
, and [[ ]]
.Arithmetic operations on vectors are performed element-wise.
This means the operation is applied to elements in the same position of each vector.
By default, *
does element-wise multiplication.
To perform true matrix multiplication, use the %*%
operator.
.Rdata
(or .rdata
) and .Rds
(or .rds
).
.Rdata
is used to save multiple R objects..Rds
is used to save a single R object..Rdata format
load("path_to_Rdata_file")
save(object_name1, object_name2, file = "path_to_Rdata_file")
.Rds format
readRDS("path_to_Rds_file")
saveRDS(object_name, file = "path_to_Rds_file")
To access to the data file, you need to provide the path to the file (the location of the data file).
Example
Suppose that I want to load data_example.rds
in the Data folder. On my computer, the full path (i.e., absolute path) to the file is /Users/qingyin/Dropbox/Teaching/R_Review_2025/Data/data_example.rds
.
Why avoid hard-coding full paths?
Typing the full file path every time is cumbersome and slows you down.
Hard-coded paths make your code less portable:
getwd()
.If you often import or save data in a specific folder, it helps to set that folder as the working directory.
Use setwd()
to change the working directory:
Example
In my case, I set the working directory to the R_Review_2025
folder.
Now, R will look for the data file in the R_Review_2025
folder by default. So, I can load the data using relative path, not absolute path.
Problems
“R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects.” - R for Data Science Ch 8.4
RStudio Projects
An RStudio project is a way to organize your work.
When you open a Project, R automatically sets the working directory to the folder containing the .Rproj
file — no need for setwd()
.
As long as the folder structure inside the Project is consistent, you can share code with teammates and relative paths will work for everyone.
Follow this steps illustrated in this document: R for Data Science Ch 8.4
In RStudio, check the top-right corner of the window to see the active Project name.
Alternatively, open a Project by double-clicking the .Rproj
file in Finder (Mac) or File Explorer (Windows).
Use getwd()
to confirm the current working directory — it should be the Project folder.
Load the data_example.rds
data file with readRDS()
.
.csv
, .xls(x)
, and.dta
.read.csv()
to read a .csv
file.read_excel()
from the readxl
package to read data sheets from an .xls(x)
file.read.dta13()
function from the readstata13
package to read a STATA data file (.dta
).Use import() function of the rio package
import()
function from the rio
package might be the most convenient one to load various format of data.
read.csv()
and read.dta13()
which specialize in reading a specific type of file, import()
can load data from various sources.In Data
folder, data_example
data is saved with three different formats: data_example.csv
, data_example.dta
, and data_example.xlsx
. Let’s load the data using import()
function on your Rstudio.
.csv
, .dta
, .xlsx
, etc.)..rds
.
saveRDS(object_name, path_to_save)
Why prefer .rds
?
.csv
or .xlsx
when data gets larger.data_example
dataset saved in different formats.)Let’s try!
data_example
data in the Data
folder.Key points
.Rproj
) is a useful tool to organize your work. As long as the folder structure under the .Rproj
is the same, you can share the code involving data loading with your team members.readRDS()
function for .Rds
(.rds
) format.import()
function from the rio
package for various format..rds
format and use saveRDS()
function.Create a sequence of numbers from 20 to 50 and name it x
. Let’s change the numbers that are multiples of 3 to 0.
sample()
is commonly used in Monte Carlo simulation in econometrics. Run the following code to create r
. What does it do? Use ?sample
to find out what the function does.
Find the value of mean and SD of vector r
without using mean()
and sd()
.
Figure out which position contains the maximum value of vector r
. (use which()
function. Run ?which()
to find out what the function does.).
Extract the values of r
that are larger than 50.
Extract the values of r
that are larger than 40 and smaller than 60.
Extract the values of r
that are smaller than 20 or larger than 70.
Load the file nscg17tiny.dta
. You can find the data in the Data
folder.
Each row corresponds to a unique respondent. Let’s create a new column called “ID”. There are various ways to create an ID column. Here, let’s create an ID column that starts from 1 and increments by 1 for each row.
To take a quick look at the summary statistics of a specific column, summary()
function is useful. Use summary()
to create a table of the descriptive statistics for hrswk. You’ll provide hrswk column to summary()
as a vector.
Create a new variable in your data that represents the z-score of the hours worked (use hrswk
variable).
\(Z = (x - \mu)/\sigma\), where \(Z = \text{standard score}\), \(x =\text{observed value}\), \(\mu = \text{mean of sample}\), and \(\sigma = \text{standard deviation of the sample}\).
Calculate the share of observations in your data sample with above average hours worked.
Function | Description |
---|---|
length() |
get the length of the vector and list object |
nrow() ,ncol()
|
get the number of rows or columns |
dim() |
get the dimension of the data |
rbind() ,cbind()
|
Combine R Objects by rows or columns |
colMeans() , rowMeans()
|
calculate the mean of each column or row |
with and within()
|
You don’t need to use `— |
Function | Description |
---|---|
sum(), mean(), var(), sd(), cov(), cor(), max(), min(), abs(), round() |
|
log() and exp()
|
Logarithms and Exponentials |
sqrt() |
Computes the square root of the specified float value. |
seq() |
Generate a sequence of numbers |
sample() |
randomly sample from a vector |
rnorm() |
generate random numbers from normal distribution |
runif() |
generate random numbers from uniform distribution |