Skip to content

Add SparkSession-based overloads for DataFrame and Dataset generators#466

Draft
Copilot wants to merge 11 commits intomainfrom
copilot/replace-sqlcontext-with-sparksession
Draft

Add SparkSession-based overloads for DataFrame and Dataset generators#466
Copilot wants to merge 11 commits intomainfrom
copilot/replace-sqlcontext-with-sparksession

Conversation

Copy link

Copilot AI commented Nov 16, 2025

Replace deprecated SQLContext with SparkSession for DataFrame / DataSet / RDD generators

Status: ✅ Complete, Tested, and Rebased

Plan:

  • Explore the repository structure and understand the codebase
  • Identify all files using SQLContext in generator APIs
  • Add new overloaded methods accepting SparkSession parameter in DataFrameGenerator
  • Add new overloaded methods accepting SparkSession parameter in DatasetGenerator
  • Keep existing SQLContext methods for backward compatibility
  • Add tests for SparkSession-based methods
  • Build and test the changes
  • Verify tests pass across Spark versions (2.4.8, 3.5.0, 4.0.0)
  • Run code review and security checks
  • Add documentation with examples
  • Fix documentation inconsistency
  • Refactor to reduce code duplication
  • Rebase onto main branch

Key Changes Made:

DataFrameGenerator.scala

  • Added arbitraryDataFrame(spark: SparkSession, schema: StructType, minPartitions: Int)
  • Added arbitraryDataFrameWithCustomFields(spark: SparkSession, schema: StructType, minPartitions: Int)(userGenerators: ColumnGeneratorBase*)
  • SparkSession methods delegate to SQLContext methods to reduce code duplication
  • Kept existing SQLContext methods with default parameters for backward compatibility
  • Fixed documentation: removed "defaults to 1" from SparkSession method docs

DatasetGenerator.scala

  • Added genDataset[T](spark: SparkSession, minPartitions: Int)(generator: => Gen[T])
  • Added genSizedDataset[T](spark: SparkSession, minPartitions: Int)(generator: Int => Gen[T])
  • Added arbitraryDataset[T](spark: SparkSession, minPartitions: Int)(generator: => Gen[T])
  • Added arbitrarySizedDataset[T](spark: SparkSession, minPartitions: Int)(generator: Int => Gen[T])
  • SparkSession methods delegate to SQLContext methods to reduce code duplication
  • Kept existing SQLContext methods with default parameters for backward compatibility
  • Fixed documentation: removed "defaults to 1" from SparkSession method docs

Tests Added

  • Added test in SampleScalaCheckTest.scala: "dataframe generator with SparkSession should work"
  • Added tests in SampleDatasetGeneratorTest.scala:
    • "test generating Datasets[String] with SparkSession"
    • "test generating sized Datasets[String] with SparkSession"

Documentation

  • Created SPARKSESSION_API_EXAMPLES.md with comprehensive examples showing:
    • How to use DataFrameGenerator with SparkSession
    • How to use DatasetGenerator with SparkSession
    • Migration guide from SQLContext to SparkSession
    • Custom field generators with SparkSession
    • Sized dataset generation with SparkSession

Test Results:

  • ✅ All tests pass after rebase with Spark 3.5.0
  • ✅ CodeQL security check: No issues found
  • ✅ Successfully rebased onto main branch

Note: RDDGenerator already uses SparkContext (not SQLContext), so no changes were needed there.

Backward Compatibility: All existing SQLContext-based methods remain functional with their default parameters, ensuring no breaking changes for existing code. SparkSession-based methods extract the SQLContext and delegate to the existing methods, making it easy to remove SQLContext support in the future when appropriate.

Rebase: This PR has been rebased onto the main branch to incorporate the latest changes including Scala 2.11 drop and version upgrades.

Original prompt

This section details on the original issue you should resolve

<issue_title>Replace deprecated SQLContext with SparkSession for DataFrame / DataSet / RDD generators</issue_title>
<issue_description>E.g. the DataframeGenerator.arbitraryDataFrame() method takes a SQLContext as parameter. According to the Spark docs [1], this has been deprecated since Spark 2.0 in favor of SparkSession. Would be nice to replace this.

[1] https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html</issue_description>

Comments on the Issue (you are @copilot in this section)

@holdenk Good call, since we don't support 1.X and will probably drop 2.4 before too long this seems like a good cleanup issue. Ideally we'd have both entry points for awhile for code compatability.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI self-assigned this Nov 16, 2025
Copilot AI and others added 2 commits November 16, 2025 02:52
- Added SparkSession parameter overloads to DataFrameGenerator.arbitraryDataFrame and arbitraryDataFrameWithCustomFields
- Added SparkSession parameter overloads to DatasetGenerator.genDataset, genSizedDataset, arbitraryDataset, and arbitrarySizedDataset
- Kept existing SQLContext-based methods for backward compatibility
- Added tests for SparkSession-based methods in SampleScalaCheckTest and SampleDatasetGeneratorTest
- All tests pass successfully

Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Copilot AI changed the title [WIP] Replace deprecated SQLContext with SparkSession in DataFrame generator Add SparkSession-based overloads for DataFrame and Dataset generators Nov 16, 2025
Copilot AI requested a review from holdenk November 16, 2025 03:04
*
* @param spark Spark Session.
* @param schema The required Schema.
* @param minPartitions minimum number of partitions, defaults to 1.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot except it doesn't default to 1 since it's an explicit parameter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 3b4f1c1. Removed the misleading "defaults to 1" text from the documentation for all SparkSession-based methods since they require an explicit minPartitions parameter.

Comment on lines +63 to +76
def arbitraryDataFrameWithCustomFields(
spark: SparkSession, schema: StructType, minPartitions: Int)
(userGenerators: ColumnGeneratorBase*): Arbitrary[DataFrame] = {
val sqlContext = spark.sqlContext

val arbitraryRDDs = RDDGenerator.genRDD(
spark.sparkContext, minPartitions)(
getRowGenerator(schema, userGenerators))
Arbitrary {
arbitraryRDDs.map { r =>
sqlContext.createDataFrame(r, schema)
}
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot rather than duplicate this code, call the sqlContext based one (and we remove it later)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in commit e9bb4e8. The SparkSession method now calls the SQLContext-based method to eliminate code duplication.

Comment on lines +99 to +108
(generator: => Gen[T]): Arbitrary[Dataset[T]] = {

val rddGen: Gen[RDD[T]] =
RDDGenerator.genRDD[T](spark.sparkContext, minPartitions)(generator)
val datasetGen: Gen[Dataset[T]] =
rddGen.map(rdd => spark.createDataset(rdd))

Arbitrary {
datasetGen
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on reducing copy pasta code @copilot call the sqlcontext one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in commit e9bb4e8. The SparkSession method now calls the SQLContext-based method to eliminate code duplication.

Comment on lines +151 to +160
(generator: Int => Gen[T]): Arbitrary[Dataset[T]] = {

val rddGen: Gen[RDD[T]] =
RDDGenerator.genSizedRDD[T](spark.sparkContext, minPartitions)(generator)
val datasetGen: Gen[Dataset[T]] =
rddGen.map(rdd => spark.createDataset(rdd))

Arbitrary {
datasetGen
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call sqlcontext one @copilot

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in commit e9bb4e8. The SparkSession method now calls the SQLContext-based method to eliminate code duplication.

Remove "defaults to 1" from minPartitions documentation for SparkSession-based methods since they don't have default parameters (Scala overload limitation)

Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Reduced code duplication by having SparkSession-based methods delegate to SQLContext-based methods instead of duplicating the implementation. This makes the code easier to maintain and allows for future removal of SQLContext methods.

Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
@holdenk
Copy link
Owner

holdenk commented Nov 20, 2025

@copilot rebase on main.

holdenk and others added 6 commits November 20, 2025 19:15
fix GHA build (#468) 

* Don't share cross java versions its bad cache magic.

* Fix the check for future timed out.

* Drop scala 2.11 support at this point.
…467)

Co-authored-by: MrPowers <matthewkevinpowers@gmail.com>
Rebased the PR branch onto the latest main branch to incorporate recent changes including:
- Drop Scala 2.11 support
- Version upgrade
- Column equality check feature

Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Copy link
Author

Copilot AI commented Nov 20, 2025

@copilot rebase on main.

Completed in commit ea9679e. The branch has been successfully rebased onto main, incorporating the latest changes including Scala 2.11 drop, version upgrades, and the column equality check feature. All tests pass after the rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace deprecated SQLContext with SparkSession for DataFrame / DataSet / RDD generators

2 participants