Add SparkSession-based overloads for DataFrame and Dataset generators by Copilot · Pull Request #466 · holdenk/spark-testing-base

Copilot · 2025-11-16T02:39:58Z

Replace deprecated SQLContext with SparkSession for DataFrame / DataSet / RDD generators

Status: ✅ Complete, Tested, and Rebased

Plan:

Key Changes Made:

DataFrameGenerator.scala

Added arbitraryDataFrame(spark: SparkSession, schema: StructType, minPartitions: Int)
Added arbitraryDataFrameWithCustomFields(spark: SparkSession, schema: StructType, minPartitions: Int)(userGenerators: ColumnGeneratorBase*)
SparkSession methods delegate to SQLContext methods to reduce code duplication
Kept existing SQLContext methods with default parameters for backward compatibility
Fixed documentation: removed "defaults to 1" from SparkSession method docs

DatasetGenerator.scala

Added genDataset[T](spark: SparkSession, minPartitions: Int)(generator: => Gen[T])
Added genSizedDataset[T](spark: SparkSession, minPartitions: Int)(generator: Int => Gen[T])
Added arbitraryDataset[T](spark: SparkSession, minPartitions: Int)(generator: => Gen[T])
Added arbitrarySizedDataset[T](spark: SparkSession, minPartitions: Int)(generator: Int => Gen[T])
SparkSession methods delegate to SQLContext methods to reduce code duplication
Kept existing SQLContext methods with default parameters for backward compatibility
Fixed documentation: removed "defaults to 1" from SparkSession method docs

Tests Added

Added test in SampleScalaCheckTest.scala: "dataframe generator with SparkSession should work"
Added tests in SampleDatasetGeneratorTest.scala:
- "test generating Datasets[String] with SparkSession"
- "test generating sized Datasets[String] with SparkSession"

Documentation

Created SPARKSESSION_API_EXAMPLES.md with comprehensive examples showing:
- How to use DataFrameGenerator with SparkSession
- How to use DatasetGenerator with SparkSession
- Migration guide from SQLContext to SparkSession
- Custom field generators with SparkSession
- Sized dataset generation with SparkSession

Test Results:

✅ All tests pass after rebase with Spark 3.5.0
✅ CodeQL security check: No issues found
✅ Successfully rebased onto main branch

Note: RDDGenerator already uses SparkContext (not SQLContext), so no changes were needed there.

Backward Compatibility: All existing SQLContext-based methods remain functional with their default parameters, ensuring no breaking changes for existing code. SparkSession-based methods extract the SQLContext and delegate to the existing methods, making it easy to remove SQLContext support in the future when appropriate.

Rebase: This PR has been rebased onto the main branch to incorporate the latest changes including Scala 2.11 drop and version upgrades.

Original prompt

This section details on the original issue you should resolve

<issue_title>Replace deprecated SQLContext with SparkSession for DataFrame / DataSet / RDD generators</issue_title>
<issue_description>E.g. the DataframeGenerator.arbitraryDataFrame() method takes a SQLContext as parameter. According to the Spark docs [1], this has been deprecated since Spark 2.0 in favor of SparkSession. Would be nice to replace this.

[1] https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html</issue_description>

Comments on the Issue (you are @copilot in this section)
@holdenk Good call, since we don't support 1.X and will probably drop 2.4 before too long this seems like a good cleanup issue. Ideally we'd have both entry points for awhile for code compatability.

Fixes Replace deprecated SQLContext with SparkSession for DataFrame / DataSet / RDD generators #418

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- Added SparkSession parameter overloads to DataFrameGenerator.arbitraryDataFrame and arbitraryDataFrameWithCustomFields - Added SparkSession parameter overloads to DatasetGenerator.genDataset, genSizedDataset, arbitraryDataset, and arbitrarySizedDataset - Kept existing SQLContext-based methods for backward compatibility - Added tests for SparkSession-based methods in SampleScalaCheckTest and SampleDatasetGeneratorTest - All tests pass successfully Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

holdenk · 2025-11-18T01:44:45Z

core/src/main/2.4/scala/com/holdenkarau/spark/testing/DataFrameGenerator.scala

+   *
+   * @param spark         Spark Session.
+   * @param schema        The required Schema.
+   * @param minPartitions minimum number of partitions, defaults to 1.


@copilot except it doesn't default to 1 since it's an explicit parameter

Fixed in commit 3b4f1c1. Removed the misleading "defaults to 1" text from the documentation for all SparkSession-based methods since they require an explicit minPartitions parameter.

holdenk · 2025-11-18T01:46:05Z

core/src/main/2.4/scala/com/holdenkarau/spark/testing/DataFrameGenerator.scala

+  def arbitraryDataFrameWithCustomFields(
+    spark: SparkSession, schema: StructType, minPartitions: Int)
+    (userGenerators: ColumnGeneratorBase*): Arbitrary[DataFrame] = {
+    val sqlContext = spark.sqlContext
+
+    val arbitraryRDDs = RDDGenerator.genRDD(
+      spark.sparkContext, minPartitions)(
+      getRowGenerator(schema, userGenerators))
+    Arbitrary {
+      arbitraryRDDs.map { r =>
+        sqlContext.createDataFrame(r, schema)
+      }
+    }
+  }


@copilot rather than duplicate this code, call the sqlContext based one (and we remove it later)

Refactored in commit e9bb4e8. The SparkSession method now calls the SQLContext-based method to eliminate code duplication.

holdenk · 2025-11-18T01:46:57Z

core/src/main/2.4/scala/com/holdenkarau/spark/testing/DatasetGenerator.scala

+    (generator: => Gen[T]): Arbitrary[Dataset[T]] = {
+
+    val rddGen: Gen[RDD[T]] =
+      RDDGenerator.genRDD[T](spark.sparkContext, minPartitions)(generator)
+    val datasetGen: Gen[Dataset[T]] =
+      rddGen.map(rdd => spark.createDataset(rdd))
+
+    Arbitrary {
+      datasetGen
+    }


Same comment on reducing copy pasta code @copilot call the sqlcontext one

Refactored in commit e9bb4e8. The SparkSession method now calls the SQLContext-based method to eliminate code duplication.

holdenk · 2025-11-18T01:47:11Z

core/src/main/2.4/scala/com/holdenkarau/spark/testing/DatasetGenerator.scala

+    (generator: Int => Gen[T]): Arbitrary[Dataset[T]] = {
+
+    val rddGen: Gen[RDD[T]] =
+      RDDGenerator.genSizedRDD[T](spark.sparkContext, minPartitions)(generator)
+    val datasetGen: Gen[Dataset[T]] =
+      rddGen.map(rdd => spark.createDataset(rdd))
+
+    Arbitrary {
+      datasetGen
+    }


call sqlcontext one @copilot

Refactored in commit e9bb4e8. The SparkSession method now calls the SQLContext-based method to eliminate code duplication.

Remove "defaults to 1" from minPartitions documentation for SparkSession-based methods since they don't have default parameters (Scala overload limitation) Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

Reduced code duplication by having SparkSession-based methods delegate to SQLContext-based methods instead of duplicating the implementation. This makes the code easier to maintain and allows for future removal of SQLContext methods. Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

holdenk · 2025-11-20T19:10:15Z

@copilot rebase on main.

fix GHA build (#468) * Don't share cross java versions its bad cache magic. * Fix the check for future timed out. * Drop scala 2.11 support at this point.

…467) Co-authored-by: MrPowers <matthewkevinpowers@gmail.com>

Rebased the PR branch onto the latest main branch to incorporate recent changes including: - Drop Scala 2.11 support - Version upgrade - Column equality check feature Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

Copilot · 2025-11-20T19:16:30Z

@copilot rebase on main.

Completed in commit ea9679e. The branch has been successfully rebased onto main, incorporating the latest changes including Scala 2.11 drop, version upgrades, and the column equality check feature. All tests pass after the rebase.

Copilot AI self-assigned this Nov 16, 2025

Initial plan

750652e

Copilot AI assigned holdenk Nov 16, 2025

Copilot started work on behalf of holdenk November 16, 2025 02:40 View session

Copilot AI and others added 2 commits November 16, 2025 02:52

Add documentation for SparkSession-based generator API

bf267f9

Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Replace deprecated SQLContext with SparkSession in DataFrame generator~~ Add SparkSession-based overloads for DataFrame and Dataset generators Nov 16, 2025

Copilot AI requested a review from holdenk November 16, 2025 03:04

Copilot finished work on behalf of holdenk November 16, 2025 03:04

holdenk reviewed Nov 18, 2025

View reviewed changes

Copilot started work on behalf of holdenk November 18, 2025 01:46 View session

holdenk reviewed Nov 18, 2025

View reviewed changes

Fix incorrect documentation for SparkSession methods

3b4f1c1

Remove "defaults to 1" from minPartitions documentation for SparkSession-based methods since they don't have default parameters (Scala overload limitation) Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

Copilot finished work on behalf of holdenk November 18, 2025 01:53

Copilot AI requested a review from holdenk November 18, 2025 01:53

Copilot started work on behalf of holdenk November 18, 2025 01:53 View session

Copilot finished work on behalf of holdenk November 18, 2025 01:59

Copilot started work on behalf of holdenk November 20, 2025 19:10 View session

holdenk and others added 6 commits November 20, 2025 19:15

Simplify release

e4ff792

Drop Scala 2.11 support, upgrade version given language drop,

da1ed2c

fix GHA build (#468) * Don't share cross java versions its bad cache magic. * Fix the check for future timed out. * Drop scala 2.11 support at this point.

Add a simple column equality check as suggested by MrPowers in #255. (#…

93608c0

…467) Co-authored-by: MrPowers <matthewkevinpowers@gmail.com>

Initial plan

11b5a15

Rebase onto main branch

8e52290

Rebased the PR branch onto the latest main branch to incorporate recent changes including: - Drop Scala 2.11 support - Version upgrade - Column equality check feature Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

Remove temporary rebase marker file

ea9679e

Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>

Copilot finished work on behalf of holdenk November 20, 2025 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SparkSession-based overloads for DataFrame and Dataset generators#466

Add SparkSession-based overloads for DataFrame and Dataset generators#466
Copilot wants to merge 11 commits intomainfrom
copilot/replace-sqlcontext-with-sparksession

Copilot AI commented Nov 16, 2025 •

edited

Loading

Uh oh!

holdenk Nov 18, 2025

Uh oh!

Copilot AI Nov 18, 2025

Uh oh!

holdenk Nov 18, 2025

Uh oh!

Copilot AI Nov 18, 2025

Uh oh!

holdenk Nov 18, 2025

Uh oh!

Copilot AI Nov 18, 2025

Uh oh!

holdenk Nov 18, 2025

Uh oh!

Copilot AI Nov 18, 2025

Uh oh!

holdenk commented Nov 20, 2025

Uh oh!

Copilot AI commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Copilot AI commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Replace deprecated SQLContext with SparkSession for DataFrame / DataSet / RDD generators

DataFrameGenerator.scala

DatasetGenerator.scala

Tests Added

Documentation

Comments on the Issue (you are @copilot in this section)

Uh oh!

holdenk Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk commented Nov 20, 2025

Uh oh!

Copilot AI commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 16, 2025 •

edited

Loading