Report Beam Lineage from Parquet reads by shnapz · Pull Request #5850 · spotify/scio

shnapz · 2026-01-02T22:01:58Z

No description provided.

shnapz · 2026-01-02T22:09:41Z

scio-parquet/src/main/scala/com/spotify/scio/parquet/tensorflow/ParquetExampleIO.scala

  ): SCollection[Example] = {
    val job = Job.getInstance(conf)
-    GcsConnectorUtil.setInputPaths(sc, job, path)
+    val filePattern = ScioUtil.filePattern(path, params.suffix)


I am surprised suffix was not used initially. Or was it intentional?

codecov · 2026-01-02T22:26:02Z

Codecov Report

❌ Patch coverage is 55.81395% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.56%. Comparing base (c234ba6) to head (b0a3741).
⚠️ Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
...ify/scio/parquet/tensorflow/ParquetExampleIO.scala	0.00%	7 Missing ⚠️
...com/spotify/scio/parquet/types/ParquetTypeIO.scala	0.00%	7 Missing ⚠️
...scala/com/spotify/scio/parquet/HadoopParquet.scala	81.48%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5850      +/-   ##
==========================================
+ Coverage   61.49%   61.56%   +0.06%     
==========================================
  Files         317      318       +1     
  Lines       11650    11678      +28     
  Branches      845      834      -11     
==========================================
+ Hits         7164     7189      +25     
- Misses       4486     4489       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scio-parquet/src/main/scala/com/spotify/scio/parquet/HadoopParquet.scala

…quet.scala

kellen · 2026-01-08T16:08:43Z

scio-parquet/src/main/scala/com/spotify/scio/parquet/avro/ParquetAvroIO.scala

+          Some(projectionFn),
+          None
+        )
+        .parDo(new LineageReportDoFn(filePattern))


Isn't this going to result in a new node in the graph? Why are we doing this in sequence w/ the read if it's not actually using any of the read elements; we should be doing like the scio init metrics which is just its own distinct graph create impulse -> submit parquet lineage

I am trying to correspond Beam conventions and have this metric associated with the actual read transform. This way we keep transform-level lineage (which is supported in Beam)

kellen · 2026-01-08T16:09:51Z

scio-parquet/src/main/scala/com/spotify/scio/parquet/read/ParquetReadFn.scala

      tracker.currentRestriction.getFrom,
      if (splitGranularity == SplitGranularity.File) "end" else tracker.currentRestriction().getTo
    )
+    FileSystems.reportSourceLineage(file.getMetadata.resourceId())


This is different than the hadoop one insofar as we get every file here, right? That seems bad/annoying for using the lineage for anything

Actually file-level lineage is the default approach in Beam. Which we might not need directly. Both Lineage Metric implementations (legacy and new one) work ok with many files:
StringSet has internal truncation to 100
BoundedTrie is a data structure that stores hierarchical data very well

kellen · 2026-01-08T16:11:27Z

scio-parquet/src/main/scala/com/spotify/scio/parquet/HadoopParquet.scala

+        override def apply(input: Void): java.lang.Boolean = true
+      })
+
+    val withSkipClone = skipValueClone.fold(hadoop)(skip => hadoop.withSkipValueClone(skip))


withSkipValueClone?

kellen · 2026-01-08T16:11:59Z

scio-parquet/src/main/scala/com/spotify/scio/parquet/HadoopParquet.scala

+import java.util.concurrent.atomic.AtomicBoolean
+import scala.reflect.ClassTag
+
+private[parquet] object HadoopParquet {


This is just to reduce duplication or is there a functional change here?

just to reduce, no new functionality, except I noticed that in some cases Scio's derived coder was not set to HadoopFormatIO transformation. Probably Beam auto-derives the same coder, but anyway it is better to set explicitly

shnapz added 2 commits January 2, 2026 17:01

Report Beam Lineage from Parquet reads

e0dc0a4

Fix issues

88cc967

shnapz commented Jan 2, 2026

View reviewed changes

shnapz added 2 commits January 5, 2026 16:35

refactor

d651108

simplify docs

785ae4a

shnapz marked this pull request as ready for review January 5, 2026 22:18

shnapz added 2 commits January 6, 2026 17:07

Refactor a lil bit

4ea6923

Move LineageDoFn outside of the object

4a650a7

shnapz commented Jan 7, 2026

View reviewed changes

scio-parquet/src/main/scala/com/spotify/scio/parquet/HadoopParquet.scala Outdated Show resolved Hide resolved

shnapz commented Jan 7, 2026

View reviewed changes

scio-parquet/src/main/scala/com/spotify/scio/parquet/HadoopParquet.scala Outdated Show resolved Hide resolved

shnapz added 3 commits January 7, 2026 17:34

move metrics report to processElement

fe790e7

Update scio-parquet/src/main/scala/com/spotify/scio/parquet/HadoopPar…

4c920e3

…quet.scala

Update scio-parquet/src/main/scala/com/spotify/scio/parquet/HadoopPar…

b0a3741

…quet.scala

shnapz marked this pull request as draft January 8, 2026 15:37

kellen reviewed Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report Beam Lineage from Parquet reads#5850

Report Beam Lineage from Parquet reads#5850
shnapz wants to merge 9 commits intomainfrom
akabas/parquet-lineage

shnapz commented Jan 2, 2026

Uh oh!

shnapz Jan 2, 2026

Uh oh!

codecov bot commented Jan 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kellen Jan 8, 2026

Uh oh!

shnapz Jan 9, 2026

Uh oh!

kellen Jan 8, 2026

Uh oh!

shnapz Jan 9, 2026

Uh oh!

kellen Jan 8, 2026

Uh oh!

kellen Jan 8, 2026

Uh oh!

shnapz Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shnapz commented Jan 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jan 2, 2026 •

edited

Loading