Department of Applied Economics
University of Minnesota
data.table
package%>%
operator from the magrittr
package (Optional)data.table
?data.table
is a package in R that provides an enhanced version of data.frame
.
dplyr
that is also popular for data wrangling. But data.table
is much faster than dplyr
particularly for large-scale data manipulation tasks.
dplyr
syntax, this website would be helpful to understand data.table
syntax.flights
data, which is obtained from nycflights13
.data.table
.data.table
package, the data must be in the data.table
class.data.frame
(or tibble) into a data.table
by using the setDT()
function.The general form of data.table
syntax is
i
: choose rows (filtering or subsetting)j
: choose or transform columns (summaries, calculations, or selecting variables)by
: group by variables (do the calculation in j
separately for each group)Simply put,
Start with a data.table DT
. First pick rows using i
, then work on columns with j
, and if needed, repeat that operation for each group defined by by.
Using data.table syntax, we will see how to:
data.table
syntax: DT[i, j, by]
i
.
DT[colA == "value", ]
selects rows where column colA
equals "value"
.Example
Subset rows where carrier
is "AA"
(American Airlines):
i
: selects rows where carrier == "AA"
j
: no action (all columns)by
: no action (no grouping)
The key idea: all tasks related to rows are done inside i
.
Example
Select flights where carrier
is "AA"
:
Return the first 5 rows:
Exclude rows 1 to 10:
Sort by month
(ascending) and then day
(descending):
data.table
syntax: DT[i, j, by]
j
argumentExample:
Suppose we want to select dep_time
column. Since we are not subsetting rows, we leave the i
argument blank.
list()
, the result will be returned as a data.table
..()
is simply shorthand for list()
in data.table
syntax.data.table
, each column is internally stored as a list. When you use .()
(or list()
) in the j
expression, each element of that list becomes a column in the resulting data.table
.You can select multiple columns just like you did to select a single column.
data.table
syntax: DT[i, j, by]
j
not only allows you to select columns but also to compute on columnsExample
Let’s count the number of trips which have had total delay < 0 (i.e., total day = dep_delay
+ arr_delay
).
What happens in this code?
i
: no action (all rows are used)j
: takes the sum of the logical vector arr_delay + dep_delay < 0
by
: no action (no grouping)Note: Since we skip the i
expression, we must include a comma before the j
expression.
data.table
syntax: DT[i, j, by]
i
and j
expressions together, you can perform calculations on the selected columns of the subsetted rows.Example
How many flights departed from “JFK” airport in the month of June?
What happens in this code?
i
: to select rows where origin airport equals “JFK”, and month equals 6.j
: to count the number of rows in the subsetted data.by
: no action (no grouping)j
..()
is a shorthand for list()
in data.table syntax. You can name each element inside .()
just like naming elements in a regular list.Example
Count how many flights departed from JFK airport in June. For those flights, calculate the average departure delay (dep_delay).
origin
, month
, arr_delay
, dep_delay
mean()
function to calculate averagesdata.table
syntax: DT[i, j, by]
j
expression, you can add or update a column in the data table using the :=
operator.
:=
as a special assignment operator inside data.table
. It modifies the data table by reference (changes the original table without making a copy).Syntax
Using the dataset below, create a new column c
that is the sum of columns a
and b
.
Important Rule
The operator :=
creates new columns by updating the data in place (by reference). This means the original data table is directly modified.
Here is how you define multiple variables at the same time.
The :=
operator in data.table
does not allow you to reference newly created or modified columns within the same [
expression.
If you want to use a new column in another calculation, you need a second [
step.
Example
c
by adding a
and b
, and (2) d
by dividing c
by a
.i
and j
expressions together, you can change the column values for rows that satisfy certain conditions.Example:
Keeping the original data:
If you want to keep the original dataset unchanged, use the data.table::copy()
function to create a duplicate.
The object created with copy()
is completely independent: changes to one will not affect the other.
data.table
syntax: DT[i, j, by]
by
argument.Syntax
Example: Let’s find the number of flights by origin
.
What happens in this code?
i
: no action (all rows)j
: count the number of rows in each group defined by by
argumentby
: group the data by origin
Nothing special. Just provide multiple columns to by
argument.
Example: Find the average time of departure delay and arrival delay by carrier
and origin
.
By combining the i
argument with by, you can perform grouped operations on a subset of rows.
Example 1: Get the number of flights for each origin airport for carrier code “AA” (American Airlines).
What happens in this code? - i
: subset rows where carrier
is “AA” - j
: count the number of rows in each group defined by by
argument - by
: group the data by origin
Example 2: Find the number of flights by origin
and month
for carrier code “AA” (American Airlines).
month
and each carrier
, calculate the total number of flights, average departure delay, and average arrival delay.Note: I used fcase()
function of data.table
package to define seasons. It is useful when you want to define a variable that takes different values based on conditions.
So far, we have covered the basic operations in the data.table
package.
Focus on these key ideas:
DT[i, j, by]
:
i
→ rowsj
→ columnsby
→ groupsi
for anything related to rows.
j
for anything related to columns.
.()
), or add/update columns with :=
.by
for anything related to grouped operations.
With just these three pieces (i
, j
, and by
), you can handle most data manipulation tasks in data.table
.
Next, we will see a few advanced topics:
%>%
operator if we have time).Data often comes in two formats: long or wide.
Example:
Long data:
Each student appears in multiple rows (one per year).
student year math reading
<char> <num> <num> <num>
1: Alice 2021 78 82
2: Alice 2022 85 88
3: Bob 2021 92 90
4: Bob 2022 95 93
5: Charlie 2021 88 85
6: Charlie 2022 90 87
7: Diana 2021 70 75
8: Diana 2022 80 83
Wide data
Each student appears in one row, with columns for each year’s scores.
student math_2021 math_2022 reading_2021 reading_2022
<char> <num> <num> <num> <num>
1: Alice 78 85 82 88
2: Bob 92 95 90 93
3: Charlie 88 90 85 87
4: Diana 70 80 75 83
dcast()
and melt()
functions of data.table
package.dcast()
function converts long form to wide formBasic Syntax:
LHS
: set of id variables (variables (columns) that you don’t want change).RHS
: set of variables to be used as the column index.value.var
: set of variables whose values will be filled to cast.Example:
Suppose that we want to organize the data so that each student’s math and reading scores appear in the same row.
Tips
LHS
, RHS
, and value.var
.melt()
function to convert wide form to long formBasic Syntax:
id.vars
: the set of id variables (variables (columns) that you don’t want change).measure.vars
: the set of columns you want to collapse (or combine) together.value.name
: (optional) the name of the new column that will store the values of the variables in measure.vars
, the default is value
Example:
Let’s get back to the original data format student_long
from student_wide
.
Summarizing is easier in long form.
Visualization is easier in long form.
ggplot
later).You can use the merge()
function from the data.table
package to merge two datasets.
Basic Syntax:
x
, y
: data tables.by
: specifies variables that let you merge two datasets.all.x = TRUE
means that all rows from x
are maintained in the merged dataset, and only matching rows from y
are included.Note: The order of the datasets matter.
Let’s play around with the merge()
function using the following small data.
Data 1
Data2
math_data
and reading_data
are slightly different. This is to show how the merge()
function works when there are unmatched rows in the two datasets.student
works because it is the key column. (1) Merge reading_data
to math_data
, keeping all rows from math_data
.
(2) Merge math_data
to reading_data
, keeping all rows from reading_data
.
(3) If you want to keep all rows from both datasets, set all = TRUE
.
flights
data, the carrier
column contains two-letter codes for airlines. Let’s translate these codes into the full name of the airline.Airline data from nycflights13
package contains the full name of the airline corresponding to the two-letter code. The following code loads the airline data.
Merge flights
and airlines
data, keeping all rows from the flights
data. Which variable should be used as a key column?
Obviously, flights
data is the main data, so we should keep all rows from the flights
data.
The key column should be carrier
because it is the common variable in both datasets, and it gives one-to-one correspondence between the two datasets.
%>%
operatorExample
Let’s first create flights_mini
data from flights
data of nycflights13
package in the data.table
format.
The first three lines yield intermediate results to make the final flight_mini
, and you don’t need to keep those.
You can create flights_mini
without using those intermediate steps with the chaining operation in data.table
package, but it’s hard to read!
%>%
a special symbol in R, called a pipe operator. It comes from the magrittr
package.Note: When you load the dplyr
package, magrittr
package is automatically loaded as well. So, you don’t need to load the magrittr
package separately to use %>%
.
%>%
takes the output of the code on its left and feeds it as the first argument to the function on its right.
Example 1
is the same as
Example 2
is the same as
What if you want to use the object defined before %>%
as the second or third argument of the subsequent function?
You can refer the preceding object by .
in the subsequent function.
Example
Tip - Whenever you use %>%
, I recommend you always use .
in the subsequent function to explicitly denote the destination of the object defined before %>%
even if it is the first argument.
Without %>%
With %>%
Note that the order of execution is the same as the order in which the functions are written.
The topics in the second part of this lecture were more advanced, so you don’t need to memorize every function right away.
What I want you to remember are the following key ideas:
You can reshape data using the functions dcast()
and melt()
. Depending on your goal, one format (wide or long) may be easier to analyze than the other.
You can merge datasets using the merge()
function, but you must have at least one common key column between the datasets.
You don’t need to use %>%
operator, unless you thinks it would be more convenient.
Find the flight company with the longest departure delay. (Hint: use max()
function to find the maximum value of dep_delay
column)
Subset the information of flights that headed to MSP (Minneapolis-St Paul International Airport) in February. Let’s name it “msp_feb_flights”. How many flights are there?
Calculate the median, interquartile range (\(IQR = Q3 − Q1\)) for arr_delays
of flights in in the msp_feb_flights
dataset and the number of flights, grouped by carrier
. Which carrier has the most variable arrival delays?
quantile()
function to calculate the quantiles.If you were selecting an airport simply based on on-time departure percentage, which NYC airport would you choose to fly out of? - To address this question, first, define a new variable which indicates on-time departure. On-time-departure can be defined as a departure delay of less than or equal to 0. Then, calculate the on-time departure rate for each airport.
For this exercise problem, we will use journal
data from the AER
package.
- First, load the data and convert it to data.table object using setDT
function (or. as.data.table()
). Take a look at the data.
- Also, type ?journal
to see the description of the data.
Calculate the average number of pages and price for the entire dataset.
Show the title
, citations
, price
, and subs
columns for the top 5 journals (title
) with the highest number of citations (citations
). (Hint: use order()
function to sort the data by citations
in descending order.).
This dataset is created in the year 2000. Calculate the age (age
) of each journal by subtracting the start year (foundingyear
) of the journal from 2000. Select the columns, price
, subs
, citations
, and pages
, and age
. Use that data to create a correlation matrix between those variables using the cor()
function. (Hint: use this syntax: cor(data)
). Can you find anything interesting from the correlation matrix?
.N
copy()
setnames()
order()
shift()
duplicated()
: find duplicatesunique()
: find unique observationsfcase()
fcase()
function is useful when you want to define a variable that takes different values based on conditions.
fcase()
function returns the first value for which the corresponding condition is TRUE
. If no condition is TRUE
, it returns the default value.
Example: Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov)