-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Problem
A common difficulty for users is understanding why a particular data.table query is slow.
When writing expressions like:
DT[ i , j , by]
there is currently no way to determine:
Whether time is spent in i filtering, j computation, or by grouping
Whether a key / index is actually being used
Whether grouping is triggering expensive materialization
Whether unexpected memory copies are occurring
Whether a join is using a fast path or falling back to a slower path
Most users rely on:
system.time(DT[i, j, by])
which measures only the total time and gives no insight into the internal bottleneck.
This makes optimization and debugging of complex workflows difficult, especially for users familiar with SQL-style tools like EXPLAIN.
Proposed Idea
Introduce an optional profiling / explain mode for data.table operations, conceptually similar to SQL’s EXPLAIN.
For example:
explain( DT[x > 5, .(m = mean(y)), by = z] )
or
options(datatable.explain = TRUE) DT[x > 5, .(m = mean(y)), by = z]
This could output structured diagnostics such as:
data.table EXPLAIN
Rows scanned: 5,000,000
Rows matched in i: 1,240,532
Groups formed by by: 2,134
Time spent:
i (filter): 120 ms
by (grouping): 340 ms
j (compute): 90 ms
Keys / indices used: YES (key: z)
Materialization: NO
Memory copies: 1 shallow copy