Department of Applied Economics
University of Minnesota
data.table package%>% operator from the magrittr package (Optional)data.table?data.table is a package in R that provides an enhanced version of data.frame.
dplyr that is also popular for data wrangling. But data.table is much faster than dplyr particularly for large-scale data manipulation tasks.
dplyr syntax, this website would be helpful to understand data.table syntax.flights data, which is obtained from nycflights13.data.table.data.table package, the data must be in the data.table class.data.frame (or tibble) into a data.table by using the setDT() function.The general form of data.table syntax is
i: choose rows (filtering or subsetting)j: choose or transform columns (summaries, calculations, or selecting variables)by: group by variables (do the calculation in j separately for each group)Simply put,
Start with a data.table DT. First pick rows using i, then work on columns with j, and if needed, repeat that operation for each group defined by by.
Using data.table syntax, we will see how to:
data.table syntax: DT[i, j, by]
i.
DT[colA == "value", ] selects rows where column colA equals "value".Example
Subset rows where carrier is "AA" (American Airlines):
i: selects rows where carrier == "AA"
j: no action (all columns)by: no action (no grouping)
The key idea: all tasks related to rows are done inside i.
Example
Select flights where carrier is "AA":
Return the first 5 rows:
Exclude rows 1 to 10:
Sort by month (ascending) and then day (descending):
data.table syntax: DT[i, j, by]
j argumentExample:
Suppose we want to select dep_time column. Since we are not subsetting rows, we leave the i argument blank.
list(), the result will be returned as a data.table..() is simply shorthand for list() in data.table syntax.data.table, each column is internally stored as a list. When you use .() (or list()) in the j expression, each element of that list becomes a column in the resulting data.table.You can select multiple columns just like you did to select a single column.
data.table syntax: DT[i, j, by]
j not only allows you to select columns but also to compute on columnsExample
Let’s count the number of trips which have had total delay < 0 (i.e., total day = dep_delay + arr_delay).
What happens in this code?
i: no action (all rows are used)j: takes the sum of the logical vector arr_delay + dep_delay < 0by: no action (no grouping)Note: Since we skip the i expression, we must include a comma before the j expression.
data.table syntax: DT[i, j, by]
i and j expressions together, you can perform calculations on the selected columns of the subsetted rows.Example
How many flights departed from “JFK” airport in the month of June?
What happens in this code?
i : to select rows where origin airport equals “JFK”, and month equals 6.j : to count the number of rows in the subsetted data.by : no action (no grouping)j..() is a shorthand for list() in data.table syntax. You can name each element inside .() just like naming elements in a regular list.Example
Count how many flights departed from JFK airport in June. For those flights, calculate the average departure delay (dep_delay).
origin, month, arr_delay, dep_delay
mean() function to calculate averagesdata.table syntax: DT[i, j, by]
j expression, you can add or update a column in the data table using the := operator.
:= as a special assignment operator inside data.table. It modifies the data table by reference (changes the original table without making a copy).Syntax
Using the dataset below, create a new column c that is the sum of columns a and b.
Important Rule
The operator := creates new columns by updating the data in place (by reference). This means the original data table is directly modified.
Here is how you define multiple variables at the same time.
The := operator in data.table does not allow you to reference newly created or modified columns within the same [ expression.
If you want to use a new column in another calculation, you need a second [ step.
Example
c by adding a and b, and (2) d by dividing c by a.i and j expressions together, you can change the column values for rows that satisfy certain conditions.Example:
Keeping the original data:
If you want to keep the original dataset unchanged, use the data.table::copy() function to create a duplicate.
The object created with copy() is completely independent: changes to one will not affect the other.
data.table syntax: DT[i, j, by]
by argument.Syntax
Example: Let’s find the number of flights by origin.
What happens in this code?
i: no action (all rows)j: count the number of rows in each group defined by by argumentby: group the data by origin
Nothing special. Just provide multiple columns to by argument.
Example: Find the average time of departure delay and arrival delay by carrier and origin.
By combining the i argument with by, you can perform grouped operations on a subset of rows.
Example 1: Get the number of flights for each origin airport for carrier code “AA” (American Airlines).
What happens in this code? - i: subset rows where carrier is “AA” - j: count the number of rows in each group defined by by argument - by: group the data by origin
Example 2: Find the number of flights by origin and month for carrier code “AA” (American Airlines).
month and each carrier, calculate the total number of flights, average departure delay, and average arrival delay.Note: I used fcase() function of data.table package to define seasons. It is useful when you want to define a variable that takes different values based on conditions.
So far, we have covered the basic operations in the data.table package.
Focus on these key ideas:
DT[i, j, by]:
i → rowsj → columnsby → groupsi for anything related to rows.
j for anything related to columns.
.()), or add/update columns with :=.by for anything related to grouped operations.
With just these three pieces (i, j, and by), you can handle most data manipulation tasks in data.table.
Next, we will see a few advanced topics:
%>% operator if we have time).Data often comes in two formats: long or wide.
Example:
Long data:
Each student appears in multiple rows (one per year).
student year math reading
<char> <num> <num> <num>
1: Alice 2021 78 82
2: Alice 2022 85 88
3: Bob 2021 92 90
4: Bob 2022 95 93
5: Charlie 2021 88 85
6: Charlie 2022 90 87
7: Diana 2021 70 75
8: Diana 2022 80 83
Wide data
Each student appears in one row, with columns for each year’s scores.
student math_2021 math_2022 reading_2021 reading_2022
<char> <num> <num> <num> <num>
1: Alice 78 85 82 88
2: Bob 92 95 90 93
3: Charlie 88 90 85 87
4: Diana 70 80 75 83
dcast() and melt() functions of data.table package.dcast() function converts long form to wide formBasic Syntax:
LHS: set of id variables (variables (columns) that you don’t want change).RHS: set of variables to be used as the column index.value.var: set of variables whose values will be filled to cast.Example:
Suppose that we want to organize the data so that each student’s math and reading scores appear in the same row.
Tips
LHS, RHS, and value.var.melt() function to convert wide form to long formBasic Syntax:
id.vars: the set of id variables (variables (columns) that you don’t want change).measure.vars: the set of columns you want to collapse (or combine) together.value.name: (optional) the name of the new column that will store the values of the variables in measure.vars, the default is value
Example:
Let’s get back to the original data format student_long from student_wide.
Summarizing is easier in long form.
Visualization is easier in long form.
ggplot later).You can use the merge() function from the data.table package to merge two datasets.
Basic Syntax:
x, y: data tables.by: specifies variables that let you merge two datasets.all.x = TRUE means that all rows from x are maintained in the merged dataset, and only matching rows from y are included.Note: The order of the datasets matter.
Let’s play around with the merge() function using the following small data.
Data 1
Data2
math_data and reading_data are slightly different. This is to show how the merge() function works when there are unmatched rows in the two datasets.student works because it is the key column. (1) Merge reading_data to math_data, keeping all rows from math_data.
(2) Merge math_data to reading_data, keeping all rows from reading_data.
(3) If you want to keep all rows from both datasets, set all = TRUE.
flights data, the carrier column contains two-letter codes for airlines. Let’s translate these codes into the full name of the airline.Airline data from nycflights13 package contains the full name of the airline corresponding to the two-letter code. The following code loads the airline data.
Merge flights and airlines data, keeping all rows from the flights data. Which variable should be used as a key column?
Obviously, flights data is the main data, so we should keep all rows from the flights data.
The key column should be carrier because it is the common variable in both datasets, and it gives one-to-one correspondence between the two datasets.
%>% operatorExample
Let’s first create flights_mini data from flights data of nycflights13 package in the data.table format.
The first three lines yield intermediate results to make the final flight_mini, and you don’t need to keep those.
You can create flights_mini without using those intermediate steps with the chaining operation in data.table package, but it’s hard to read!
%>% a special symbol in R, called a pipe operator. It comes from the magrittr package.Note: When you load the dplyr package, magrittr package is automatically loaded as well. So, you don’t need to load the magrittr package separately to use %>%.
%>% takes the output of the code on its left and feeds it as the first argument to the function on its right.
Example 1
is the same as
Example 2
is the same as
What if you want to use the object defined before %>% as the second or third argument of the subsequent function?
You can refer the preceding object by . in the subsequent function.
Example
Tip - Whenever you use %>%, I recommend you always use . in the subsequent function to explicitly denote the destination of the object defined before %>% even if it is the first argument.
Without %>%
With %>%
Note that the order of execution is the same as the order in which the functions are written.
The topics in the second part of this lecture were more advanced, so you don’t need to memorize every function right away.
What I want you to remember are the following key ideas:
You can reshape data using the functions dcast() and melt(). Depending on your goal, one format (wide or long) may be easier to analyze than the other.
You can merge datasets using the merge() function, but you must have at least one common key column between the datasets.
You don’t need to use %>% operator, unless you thinks it would be more convenient.
Find the flight company with the longest departure delay. (Hint: use max() function to find the maximum value of dep_delay column)
Subset the information of flights that headed to MSP (Minneapolis-St Paul International Airport) in February. Let’s name it “msp_feb_flights”. How many flights are there?
Calculate the median, interquartile range (\(IQR = Q3 − Q1\)) for arr_delays of flights in in the msp_feb_flights dataset and the number of flights, grouped by carrier. Which carrier has the most variable arrival delays?
quantile() function to calculate the quantiles.If you were selecting an airport simply based on on-time departure percentage, which NYC airport would you choose to fly out of? - To address this question, first, define a new variable which indicates on-time departure. On-time-departure can be defined as a departure delay of less than or equal to 0. Then, calculate the on-time departure rate for each airport.
For this exercise problem, we will use journal data from the AER package.
- First, load the data and convert it to data.table object using setDT function (or. as.data.table()). Take a look at the data.
- Also, type ?journal to see the description of the data.
Calculate the average number of pages and price for the entire dataset.
Show the title, citations, price, and subs columns for the top 5 journals (title) with the highest number of citations (citations). (Hint: use order() function to sort the data by citations in descending order.).
This dataset is created in the year 2000. Calculate the age (age) of each journal by subtracting the start year (foundingyear) of the journal from 2000. Select the columns, price, subs, citations, and pages, and age. Use that data to create a correlation matrix between those variables using the cor() function. (Hint: use this syntax: cor(data)). Can you find anything interesting from the correlation matrix?
.Ncopy()setnames()order()shift()duplicated(): find duplicatesunique(): find unique observationsfcase()fcase() function is useful when you want to define a variable that takes different values based on conditions.
fcase() function returns the first value for which the corresponding condition is TRUE. If no condition is TRUE, it returns the default value.
Example: Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov)