Fast Data with data.table
data.table is a high-performance R package that supercharges the data frame, using a compact dt[i, j, by] bracket syntax to filter, compute, and group huge datasets quickly and with very little memory.
Learn Fast Data with data.table in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.
Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
By the end of this lesson you'll read the dt[i, j, by] grammar, aggregate by group, update columns in place with :=, use .N and .SD, load files fast with fread, and know when data.table beats dplyr.
What You'll Learn in This Lesson
1️⃣ The dt[i, j, by] Grammar
Everything in data.table happens inside square brackets with up to three slots. The first, i , filters rows. The second, j , selects or computes columns. The third, by , groups the computation. Read dt[i, j, by] as "take rows i , do j , for each by ."
2️⃣ Aggregating with by, .N, and .SD
Put a summary function in j and a grouping column in by to aggregate. Wrap results in .() to name them. The special symbol .N counts rows per group, and .SD lets you apply a function across many columns at once.
3️⃣ In-Place Updates and Fast I/O
The := operator adds or changes columns by reference — modifying the table in place with no copy, which is what makes data.table so memory-efficient on large data. For getting big files in and out, fread() and fwrite() are far faster than base R.
Your turn. Fill in the # TODO blank and run it.
Build a small end-to-end pipeline: add a computed column in place, aggregate by group, filter, and count with .N — all in data.table's bracket grammar.
📋 Quick Reference — data.table
Practice quiz
In the dt[i, j, by] grammar, what does the i slot do?
- Computes columns
- Groups the result
- Selects rows (filters)
- Sets the key
Answer: Selects rows (filters). i filters rows, j computes columns, by groups the computation.
Which slot of dt[i, j, by] performs grouping?
- by
- i
- j
- with
Answer: by. The third slot, by, defines the grouping variable(s).
What does the := operator do in data.table?
- Compares two values
- Defines a function
- Sorts the table
- Updates columns in place by reference
Answer: Updates columns in place by reference. := modifies the table in place without copying it.
Because := updates by reference, you should NOT write...
The table is already changed; reassigning with <- is redundant and confusing.
What does the special symbol .N give you?
- The number of rows in the current group
- The names of columns
- The sum of a column
- The number of groups
Answer: The number of rows in the current group. .N counts rows in the current group or the whole table.
Why wrap aggregation results in .() as in .(total = sum(sales))?
- To sort the output
- To convert to a data.frame
- To name the resulting column(s)
- To remove NAs
Answer: To name the resulting column(s). Without .() and a name, the column is auto-named V1.
What does dt[, sum(sales), by = city] compute?
- Sales for the first city only
- Total sales for each city
- A single grand total
- The number of cities
Answer: Total sales for each city. Putting sum(sales) in j with by = city aggregates per city.
How do you remove a column called tax from a data.table?
Assigning NULL with := removes the column in place.
Which function reads a large CSV faster than read.csv?
- scan()
- fread()
- readLines()
- load()
Answer: fread(). fread() is data.table's fast, auto-detecting file reader.
When is data.table typically preferred over dplyr?
- When you only have 3 rows
- When you need ggplot2 charts
- On large data where speed and low memory use matter
- When the data is already sorted
Answer: On large data where speed and low memory use matter. data.table shines on big data thanks to in-place updates and a fast C backend.