-
Notifications
You must be signed in to change notification settings - Fork 386
Description
Run-level Graph
A run-level graph represents the relationships between dataset and run metadata. A run-level graph is directed and consists of three node types: dataset version, job version, and run (see Figure 1). A run node may have one or more versioned inputs and versioned outputs as edges. An edge from a run node to a job version node is also maintained and represents the version of the job (=link to source code) at time of execution.
Figure 1: Run-level graph relationships between dataset versions, job versions, and runs.
Note that a dataset is assumed to be modified as the result of a successful run. For a run to be marked successful, the run must transition from a RUNNING state to a COMPLETED state. A run-level graph dynamically captures all modifications made to a given dataset from run-to-run.
Introduction
A run-level graph is fundamental in troubleshooting data issues. For example, the data type of a column within a table may change resulting in unanticipated downstream job failures.
Often, it's both challenging and time consuming to determine the cause of why a given job might be failing. Using the run-level graph, you can observe the upstream lineage of the failing job, therefore, simplifing troubleshooting by highlighting that, for example, the data type of a column is now a STRING upstream, though the failing job was processing the column as an INT downstream.
Graph Data Model
A run-level graph consists of the following nodes:
- Dataset Version: A read-only immutable version of a dataset.
- Job Version: A read-only immutable version of a job, with a unique referenceable link to code preserving the reproducibility of builds from source.
- Run: A discrete instantiation of a job version, with a unique run ID used to update each stage of execution.
Nodes
| ID | dataset:{namespace}:{dataset}#{version} |
|---|---|
| Example | dataset:food_delivery:public.top_delivery_times#947c0388.. |
| ID | job:{namespace}:{job}#{version} |
|---|---|
| Example | job:food_delivery:orders_popular_day_of_week#947c0388.. |
| ID | run:{id} |
|---|---|
| Example | run:a03422cf.. |
Edges
- {
dataset:*,TO,run:*} - {
run:*,TO,dataset:*} - {
run:*,IS_VERSION_OF,job:*}
Example
Run a03422cf
First, we create the run a03422cf for orders_popular_day_of_week that consumes the input version 695888e2 and produces the output version a03422cf:
Figure 2:
Run ec6abf8b
Then, we create another run ec6abf8b that consumes the same input version 695888e2, but produces a new output version ec44fed4:
Figure 3:
Run diff from a03422cf to ec6abf8b
A diff graph represents the changes between two run nodes of a run-level graph. The graph compares changes starting at a given run node A, up to a given run node B (inclusive). Below we show a run-based comparison for the job orders_popular_day_of_week between runs a03422cf and ec6abf8b:
Figure 4 Diff from
a03422cftoec6abf8b
Metadata
Metadata
Assignees
Labels
Type
Projects
Status



