Fast Data with data.table

data.table is a high-performance R package that supercharges the data frame, using a compact dt[i, j, by] bracket syntax to filter, compute, and group huge datasets quickly and with very little memory.

Learn Fast Data with data.table in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.

Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

By the end of this lesson you'll read the dt[i, j, by] grammar, aggregate by group, update columns in place with :=, use .N and .SD, load files fast with fread, and know when data.table beats dplyr.

What You'll Learn in This Lesson

1️⃣ The dt[i, j, by] Grammar

Everything in data.table happens inside square brackets with up to three slots. The first, i , filters rows. The second, j , selects or computes columns. The third, by , groups the computation. Read dt[i, j, by] as "take rows i , do j , for each by ."

2️⃣ Aggregating with by, .N, and .SD

Put a summary function in j and a grouping column in by to aggregate. Wrap results in .() to name them. The special symbol .N counts rows per group, and .SD lets you apply a function across many columns at once.

3️⃣ In-Place Updates and Fast I/O

The := operator adds or changes columns by reference — modifying the table in place with no copy, which is what makes data.table so memory-efficient on large data. For getting big files in and out, fread() and fwrite() are far faster than base R.

Your turn. Fill in the # TODO blank and run it.

Build a small end-to-end pipeline: add a computed column in place, aggregate by group, filter, and count with .N — all in data.table's bracket grammar.

📋 Quick Reference — data.table

Practice quiz

In the dt[i, j, by] grammar, what does the i slot do?

Computes columns
Groups the result
Selects rows (filters)
Sets the key

Answer: Selects rows (filters). i filters rows, j computes columns, by groups the computation.

Which slot of dt[i, j, by] performs grouping?

by
i
j
with

Answer: by. The third slot, by, defines the grouping variable(s).

What does the := operator do in data.table?

Compares two values
Defines a function
Sorts the table
Updates columns in place by reference

Answer: Updates columns in place by reference. := modifies the table in place without copying it.

Because := updates by reference, you should NOT write...

The table is already changed; reassigning with <- is redundant and confusing.

What does the special symbol .N give you?

The number of rows in the current group
The names of columns
The sum of a column
The number of groups

Answer: The number of rows in the current group. .N counts rows in the current group or the whole table.

Why wrap aggregation results in .() as in .(total = sum(sales))?

To sort the output
To convert to a data.frame
To name the resulting column(s)
To remove NAs

Answer: To name the resulting column(s). Without .() and a name, the column is auto-named V1.

What does dt[, sum(sales), by = city] compute?

Sales for the first city only
Total sales for each city
A single grand total
The number of cities

Answer: Total sales for each city. Putting sum(sales) in j with by = city aggregates per city.

How do you remove a column called tax from a data.table?

Assigning NULL with := removes the column in place.

Which function reads a large CSV faster than read.csv?

scan()
fread()
readLines()
load()

Answer: fread(). fread() is data.table's fast, auto-detecting file reader.

When is data.table typically preferred over dplyr?

When you only have 3 rows
When you need ggplot2 charts
On large data where speed and low memory use matter
When the data is already sorted

Answer: On large data where speed and low memory use matter. data.table shines on big data thanks to in-place updates and a fast C backend.