Day 2: Data wrangling with data.table

Qingyin Cai

Department of Applied Economics
University of Minnesota

Learning Objectives

  • Use basic data wrangling skills with the data.table package
  • Learn how to use the %>% operator from the magrittr package (Optional)


Reference

Today’s outline:

  1. Data manipulation with data.table
  2. %>% operator (optional)
  3. After-class Exercise Problems
  4. Appendix

Introduction to data.table

What is data.table?

  • data.table is a package in R that provides an enhanced version of data.frame.
    • It is designed to be fast and memory efficient.


  • There is another package called dplyr that is also popular for data wrangling. But data.table is much faster than dplyr particularly for large-scale data manipulation tasks.
    • See this for the speed comparison of dplyr and data.table.
    • This website compares dplyr vs data.table side by side. If you already know dplyr syntax, this website would be helpful to understand data.table syntax.
  1. Let’s use flights data, which is obtained from nycflights13.


  1. Converting to data.table.
  • To use the special features of the data.table package, the data must be in the data.table class.
  • You can convert a data.frame (or tibble) into a data.table by using the setDT() function.

data.table syntax

The general form of data.table syntax is

  • i: choose rows (filtering or subsetting)
  • j: choose or transform columns (summaries, calculations, or selecting variables)
  • by: group by variables (do the calculation in j separately for each group)


Simply put,

Start with a data.table DT. First pick rows using i, then work on columns with j, and if needed, repeat that operation for each group defined by by.

Using data.table syntax, we will see how to:

  • subset rows
  • select columns, compute on the selected columns, create a new column
  • perform aggregations by group

1. Subset Rows

  • data.table syntax: DT[i, j, by]
  • To subset rows, put a condition on a column inside i.
    • Example: DT[colA == "value", ] selects rows where column colA equals "value".

Example

Subset rows where carrier is "AA" (American Airlines):


  • What happens here?
    • i: selects rows where carrier == "AA"
    • j: no action (all columns)
    • by: no action (no grouping)

2. Select Columns

  • data.table syntax: DT[i, j, by]
  • To select columns, use the j argument

Example:

Suppose we want to select dep_time column. Since we are not subsetting rows, we leave the i argument blank.

  • If we wrap variables (column names) in list(), the result will be returned as a data.table.
  • .() is simply shorthand for list() in data.table syntax.
  • Important: In data.table, each column is internally stored as a list. When you use .() (or list()) in the j expression, each element of that list becomes a column in the resulting data.table.


You can select multiple columns just like you did to select a single column.

3. Compute on Columns


Basics

  • data.table syntax: DT[i, j, by]
  • j not only allows you to select columns but also to compute on columns

Example

Let’s count the number of trips which have had total delay < 0 (i.e., total day = dep_delay + arr_delay).

What happens in this code?

  • i: no action (all rows are used)
  • j: takes the sum of the logical vector arr_delay + dep_delay < 0
  • by: no action (no grouping)

Note: Since we skip the i expression, we must include a comma before the j expression.

3. Compute on Columns of the Subsetted Rows

  • data.table syntax: DT[i, j, by]
  • Using i and j expressions together, you can perform calculations on the selected columns of the subsetted rows.


Example

How many flights departed from “JFK” airport in the month of June?


What happens in this code?

  • i : to select rows where origin airport equals “JFK”, and month equals 6.
  • j : to count the number of rows in the subsetted data.
  • by : no action (no grouping)
  • You can assign names to the values you calculate in j.
  • Recall that .() is a shorthand for list() in data.table syntax. You can name each element inside .() just like naming elements in a regular list.


Example

Count how many flights departed from JFK airport in June. For those flights, calculate the average departure delay (dep_delay).

  1. Find the average arrival delay and the average departure delay for flights that departed from JFK in August.
  • Hint:
    • Use the columns: origin, month, arr_delay, dep_delay
    • Use the mean() function to calculate averages
  1. Find the average arrival delay and the average departure delay for flights that departed from JFK in August.

4. Create a New Column

  • data.table syntax: DT[i, j, by]
  • In j expression, you can add or update a column in the data table using the := operator.
    • Think of := as a special assignment operator inside data.table. It modifies the data table by reference (changes the original table without making a copy).


Syntax

# === Add one column === #
DT[, "new_column_name" := .(valueA)]

# or you can drop the quotes and `.()` for convenience
DT[, new_column_name := valueA]

Using the dataset below, create a new column c that is the sum of columns a and b.


Important Rule

The operator := creates new columns by updating the data in place (by reference). This means the original data table is directly modified.

Here is how you define multiple variables at the same time.

  • The := operator in data.table does not allow you to reference newly created or modified columns within the same [ expression.

  • If you want to use a new column in another calculation, you need a second [ step.

Example

  • Let’s create two new columns: (1) c by adding a and b, and (2) d by dividing c by a.
  • Using i and j expressions together, you can change the column values for rows that satisfy certain conditions.

Example:


Keeping the original data:

  • If you want to keep the original dataset unchanged, use the data.table::copy() function to create a duplicate.

  • The object created with copy() is completely independent: changes to one will not affect the other.

Create two new columns in the flights data:

  • total_delay: the sum of dep_delay and arr_delay.
  • speed: the ratio of distance to air_time (i.e, distance/air_time.)

5. Perform Aggregations by Group (Grouped Operations)

  • data.table syntax: DT[i, j, by]
  • To perform grouped operations, use by argument.

Syntax

DT[, .(new_column = function(column)), by = .(group_variable)]


Example: Let’s find the number of flights by origin.


What happens in this code?

  • i: no action (all rows)
  • j: count the number of rows in each group defined by by argument
  • by: group the data by origin

Nothing special. Just provide multiple columns to by argument.


Example: Find the average time of departure delay and arrival delay by carrier and origin.

By combining the i argument with by, you can perform grouped operations on a subset of rows.


Example 1: Get the number of flights for each origin airport for carrier code “AA” (American Airlines).

What happens in this code? - i: subset rows where carrier is “AA” - j: count the number of rows in each group defined by by argument - by: group the data by origin


Example 2: Find the number of flights by origin and month for carrier code “AA” (American Airlines).

  1. For each month and each carrier, calculate the total number of flights, average departure delay, and average arrival delay.


  1. (Optional) Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov) and summarize the total number of flights, average departure delay, and average arrival delay for each season and each carrier.
  1. For each month and each carrier, calculate the total number of flights, average departure delay, and average arrival delay.


  1. (Optional) Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov) and summarize the total number of flights, average departure delay, and average arrival delay for each season and each carrier.

Note: I used fcase() function of data.table package to define seasons. It is useful when you want to define a variable that takes different values based on conditions.

Summary

So far, we have covered the basic operations in the data.table package.

Focus on these key ideas:

  • The general syntax is DT[i, j, by]:
    • i → rows
    • j → columns
    • by → groups
  • Use i for anything related to rows.
    • Example: filter rows with conditions.
  • Use j for anything related to columns.
    • Example: select columns, compute new values (use .()), or add/update columns with :=.
  • Use by for anything related to grouped operations.
    • Example: calculate summaries by group.


With just these three pieces (i, j, and by), you can handle most data manipulation tasks in data.table.

Next, we will see a few advanced topics:

  • Reshaping Data
  • Merging Multiple Datasets
  • (and the %>% operator if we have time).

6. Reshape Data

Data often comes in two formats: long or wide.

Example:

Long data:

Each student appears in multiple rows (one per year).

   student  year  math reading
    <char> <num> <num>   <num>
1:   Alice  2021    78      82
2:   Alice  2022    85      88
3:     Bob  2021    92      90
4:     Bob  2022    95      93
5: Charlie  2021    88      85
6: Charlie  2022    90      87
7:   Diana  2021    70      75
8:   Diana  2022    80      83

Wide data

Each student appears in one row, with columns for each year’s scores.

   student math_2021 math_2022 reading_2021 reading_2022
    <char>     <num>     <num>        <num>        <num>
1:   Alice        78        85           82           88
2:     Bob        92        95           90           93
3: Charlie        88        90           85           87
4:   Diana        70        80           75           83


  • We can convert one format to another using dcast() and melt() functions of data.table package.
  • Use dcast() function converts long form to wide form

Basic Syntax:

dcast(data, LHS ~ RHS , value.var = c("var1", "var2"))
  • LHS: set of id variables (variables (columns) that you don’t want change).
  • RHS: set of variables to be used as the column index.
  • value.var: set of variables whose values will be filled to cast.


Example:

Suppose that we want to organize the data so that each student’s math and reading scores appear in the same row.


Tips

  • Before coding a reshape, first visualize the format you want the data to take.
  • I often sketch a small example table.
    • This helps me to understand what variables I need to use as LHS, RHS, and value.var.
  • Use melt() function to convert wide form to long form

Basic Syntax:

melt(data, id.var = c("id_var1", "id_var2"), measure.vars = c("var1", "var2"))
  • id.vars: the set of id variables (variables (columns) that you don’t want change).
  • measure.vars: the set of columns you want to collapse (or combine) together.
  • value.name: (optional) the name of the new column that will store the values of the variables in measure.vars, the default is value


Example:

Let’s get back to the original data format student_long from student_wide.

  • Notice that the year info is stored as variable (1, 2).

Summarizing is easier in long form.

  • Example: average math/reading score by year.


Visualization is easier in long form.

  • Example: plotting scores by year (when we use ggplot later).

Using the following long-form data named long_data, can you get back student_long?


7. Merge Multiple Datasets

You can use the merge() function from the data.table package to merge two datasets.

Basic Syntax:

# Merge data y to data x keeping all rows from data x
merge(x, y, by = "key_column", all.x = TRUE)
  • x, y: data tables.
  • by: specifies variables that let you merge two datasets.
  • all.x = TRUE means that all rows from x are maintained in the merged dataset, and only matching rows from y are included.

Note: The order of the datasets matter.

Let’s play around with the merge() function using the following small data.

Data 1

Data2

  • Note that the students included inmath_data and reading_data are slightly different. This is to show how the merge() function works when there are unmatched rows in the two datasets.
  • To merge these two datasets, student works because it is the key column.

(1) Merge reading_data to math_data, keeping all rows from math_data.


(2) Merge math_data to reading_data, keeping all rows from reading_data.


(3) If you want to keep all rows from both datasets, set all = TRUE.

  1. In the flights data, the carrier column contains two-letter codes for airlines. Let’s translate these codes into the full name of the airline.

Airline data from nycflights13 package contains the full name of the airline corresponding to the two-letter code. The following code loads the airline data.


Merge flights and airlines data, keeping all rows from the flights data. Which variable should be used as a key column?

  • Obviously, flights data is the main data, so we should keep all rows from the flights data.

  • The key column should be carrier because it is the common variable in both datasets, and it gives one-to-one correspondence between the two datasets.

Run the following code to create two datasets: math_data and reading_data.


Merge these two datasets, keeping all rows from math_data. Which variable(s) should be used as key columns?

  • Here you should use both student and year as key columns.

%>% operator

Motivation

  • In R, you need to assign the result of each operation to a new object if you want to use the result in the subsequent process.
  • But sometimes, some objects are just intermediate results that you don’t need to keep.

Example

Let’s first create flights_mini data from flights data of nycflights13 package in the data.table format.

The first three lines yield intermediate results to make the final flight_mini, and you don’t need to keep those.


You can create flights_mini without using those intermediate steps with the chaining operation in data.table package, but it’s hard to read!

Introduction

  • %>% a special symbol in R, called a pipe operator. It comes from the magrittr package.
  • It’s a powerful tool to write linear sequence of operations in a more readable way.


Note: When you load the dplyr package, magrittr package is automatically loaded as well. So, you don’t need to load the magrittr package separately to use %>%.

%>% takes the output of the code on its left and feeds it as the first argument to the function on its right.


Example 1

fun1(input1) 

is the same as

input1 %>% fun1()


Example 2

output1 <- fun1(input1)
output2 <- fun2(output1)

is the same as

output2 <- fun1(input1) %>% fun2()

What if you want to use the object defined before %>% as the second or third argument of the subsequent function?

You can refer the preceding object by . in the subsequent function.

Example

Tip - Whenever you use %>%, I recommend you always use . in the subsequent function to explicitly denote the destination of the object defined before %>% even if it is the first argument.

Example

Without %>%

With %>%


Note that the order of execution is the same as the order in which the functions are written.

Summary


The topics in the second part of this lecture were more advanced, so you don’t need to memorize every function right away.

What I want you to remember are the following key ideas:

  1. You can reshape data using the functions dcast() and melt(). Depending on your goal, one format (wide or long) may be easier to analyze than the other.

  2. You can merge datasets using the merge() function, but you must have at least one common key column between the datasets.

You don’t need to use %>% operator, unless you thinks it would be more convenient.

After-class Exercise Problems

Exercise 1

  1. Find the flight company with the longest departure delay. (Hint: use max() function to find the maximum value of dep_delay column)

  2. Subset the information of flights that headed to MSP (Minneapolis-St Paul International Airport) in February. Let’s name it “msp_feb_flights”. How many flights are there?

  3. Calculate the median, interquartile range (\(IQR = Q3 − Q1\)) for arr_delays of flights in in the msp_feb_flights dataset and the number of flights, grouped by carrier. Which carrier has the most variable arrival delays?

  • Hints: IQR = Q3 − Q1 (the difference between the 75th percentile and the 25th percentile.) Use quantile() function to calculate the quantiles.

Exercise 2

If you were selecting an airport simply based on on-time departure percentage, which NYC airport would you choose to fly out of? - To address this question, first, define a new variable which indicates on-time departure. On-time-departure can be defined as a departure delay of less than or equal to 0. Then, calculate the on-time departure rate for each airport.


Exercise 3

For this exercise problem, we will use journal data from the AER package.
- First, load the data and convert it to data.table object using setDT function (or. as.data.table()). Take a look at the data.
- Also, type ?journal to see the description of the data.


  1. Calculate the average number of pages and price for the entire dataset.

  2. Show the title, citations, price, and subs columns for the top 5 journals (title) with the highest number of citations (citations). (Hint: use order() function to sort the data by citations in descending order.).

  3. This dataset is created in the year 2000. Calculate the age (age) of each journal by subtracting the start year (foundingyear) of the journal from 2000. Select the columns, price, subs, citations, and pages, and age. Use that data to create a correlation matrix between those variables using the cor() function. (Hint: use this syntax: cor(data)). Can you find anything interesting from the correlation matrix?

Appendix

Useful functions

fcase()

  • fcase() function is useful when you want to define a variable that takes different values based on conditions.

  • fcase() function returns the first value for which the corresponding condition is TRUE. If no condition is TRUE, it returns the default value.


Example: Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov)