“EXPLAIN / profiling” support for [i, j, by] to diagnose performance bottlenecks

Problem

A common difficulty for users is understanding why a particular data.table query is slow.

When writing expressions like:

`DT[ i , j , by]`


there is currently no way to determine:

Whether time is spent in i filtering, j computation, or by grouping

Whether a key / index is actually being used

Whether grouping is triggering expensive materialization

Whether unexpected memory copies are occurring

Whether a join is using a fast path or falling back to a slower path

Most users rely on:

`system.time(DT[i, j, by])`


which measures only the total time and gives no insight into the internal bottleneck.

This makes optimization and debugging of complex workflows difficult, especially for users familiar with SQL-style tools like EXPLAIN.

Proposed Idea

Introduce an optional profiling / explain mode for data.table operations, conceptually similar to SQL’s EXPLAIN.

For example:

`explain(
  DT[x > 5, .(m = mean(y)), by = z]
)
`

or

`options(datatable.explain = TRUE)
DT[x > 5, .(m = mean(y)), by = z]`


This could output structured diagnostics such as:

data.table EXPLAIN
------------------
Rows scanned:           5,000,000
Rows matched in i:      1,240,532
Groups formed by by:  2,134

Time spent:
  i (filter):           120 ms
  by (grouping):        340 ms
  j (compute):          90 ms

Keys / indices used:    YES (key: z)
Materialization:        NO
Memory copies:          1 shallow copy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

“EXPLAIN / profiling” support for [i, j, by] to diagnose performance bottlenecks #7620

data.table EXPLAIN

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

“EXPLAIN / profiling” support for [i, j, by] to diagnose performance bottlenecks #7620

Description

data.table EXPLAIN

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions