Skip to content

“EXPLAIN / profiling” support for [i, j, by] to diagnose performance bottlenecks #7620

@aksh08022006

Description

@aksh08022006

Problem

A common difficulty for users is understanding why a particular data.table query is slow.

When writing expressions like:

DT[ i , j , by]

there is currently no way to determine:

Whether time is spent in i filtering, j computation, or by grouping

Whether a key / index is actually being used

Whether grouping is triggering expensive materialization

Whether unexpected memory copies are occurring

Whether a join is using a fast path or falling back to a slower path

Most users rely on:

system.time(DT[i, j, by])

which measures only the total time and gives no insight into the internal bottleneck.

This makes optimization and debugging of complex workflows difficult, especially for users familiar with SQL-style tools like EXPLAIN.

Proposed Idea

Introduce an optional profiling / explain mode for data.table operations, conceptually similar to SQL’s EXPLAIN.

For example:

explain( DT[x > 5, .(m = mean(y)), by = z] )

or

options(datatable.explain = TRUE) DT[x > 5, .(m = mean(y)), by = z]

This could output structured diagnostics such as:

data.table EXPLAIN

Rows scanned: 5,000,000
Rows matched in i: 1,240,532
Groups formed by by: 2,134

Time spent:
i (filter): 120 ms
by (grouping): 340 ms
j (compute): 90 ms

Keys / indices used: YES (key: z)
Materialization: NO
Memory copies: 1 shallow copy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions