Add SparkSession-based overloads for DataFrame and Dataset generators#466
Add SparkSession-based overloads for DataFrame and Dataset generators#466
Conversation
- Added SparkSession parameter overloads to DataFrameGenerator.arbitraryDataFrame and arbitraryDataFrameWithCustomFields - Added SparkSession parameter overloads to DatasetGenerator.genDataset, genSizedDataset, arbitraryDataset, and arbitrarySizedDataset - Kept existing SQLContext-based methods for backward compatibility - Added tests for SparkSession-based methods in SampleScalaCheckTest and SampleDatasetGeneratorTest - All tests pass successfully Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
| * | ||
| * @param spark Spark Session. | ||
| * @param schema The required Schema. | ||
| * @param minPartitions minimum number of partitions, defaults to 1. |
There was a problem hiding this comment.
@copilot except it doesn't default to 1 since it's an explicit parameter
There was a problem hiding this comment.
Fixed in commit 3b4f1c1. Removed the misleading "defaults to 1" text from the documentation for all SparkSession-based methods since they require an explicit minPartitions parameter.
| def arbitraryDataFrameWithCustomFields( | ||
| spark: SparkSession, schema: StructType, minPartitions: Int) | ||
| (userGenerators: ColumnGeneratorBase*): Arbitrary[DataFrame] = { | ||
| val sqlContext = spark.sqlContext | ||
|
|
||
| val arbitraryRDDs = RDDGenerator.genRDD( | ||
| spark.sparkContext, minPartitions)( | ||
| getRowGenerator(schema, userGenerators)) | ||
| Arbitrary { | ||
| arbitraryRDDs.map { r => | ||
| sqlContext.createDataFrame(r, schema) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
@copilot rather than duplicate this code, call the sqlContext based one (and we remove it later)
There was a problem hiding this comment.
Refactored in commit e9bb4e8. The SparkSession method now calls the SQLContext-based method to eliminate code duplication.
| (generator: => Gen[T]): Arbitrary[Dataset[T]] = { | ||
|
|
||
| val rddGen: Gen[RDD[T]] = | ||
| RDDGenerator.genRDD[T](spark.sparkContext, minPartitions)(generator) | ||
| val datasetGen: Gen[Dataset[T]] = | ||
| rddGen.map(rdd => spark.createDataset(rdd)) | ||
|
|
||
| Arbitrary { | ||
| datasetGen | ||
| } |
| (generator: Int => Gen[T]): Arbitrary[Dataset[T]] = { | ||
|
|
||
| val rddGen: Gen[RDD[T]] = | ||
| RDDGenerator.genSizedRDD[T](spark.sparkContext, minPartitions)(generator) | ||
| val datasetGen: Gen[Dataset[T]] = | ||
| rddGen.map(rdd => spark.createDataset(rdd)) | ||
|
|
||
| Arbitrary { | ||
| datasetGen | ||
| } |
There was a problem hiding this comment.
Refactored in commit e9bb4e8. The SparkSession method now calls the SQLContext-based method to eliminate code duplication.
Remove "defaults to 1" from minPartitions documentation for SparkSession-based methods since they don't have default parameters (Scala overload limitation) Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Reduced code duplication by having SparkSession-based methods delegate to SQLContext-based methods instead of duplicating the implementation. This makes the code easier to maintain and allows for future removal of SQLContext methods. Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
|
@copilot rebase on main. |
fix GHA build (#468) * Don't share cross java versions its bad cache magic. * Fix the check for future timed out. * Drop scala 2.11 support at this point.
Rebased the PR branch onto the latest main branch to incorporate recent changes including: - Drop Scala 2.11 support - Version upgrade - Column equality check feature Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Co-authored-by: holdenk <59893+holdenk@users.noreply.github.com>
Replace deprecated SQLContext with SparkSession for DataFrame / DataSet / RDD generators
Status: ✅ Complete, Tested, and Rebased
Plan:
Key Changes Made:
DataFrameGenerator.scala
arbitraryDataFrame(spark: SparkSession, schema: StructType, minPartitions: Int)arbitraryDataFrameWithCustomFields(spark: SparkSession, schema: StructType, minPartitions: Int)(userGenerators: ColumnGeneratorBase*)DatasetGenerator.scala
genDataset[T](spark: SparkSession, minPartitions: Int)(generator: => Gen[T])genSizedDataset[T](spark: SparkSession, minPartitions: Int)(generator: Int => Gen[T])arbitraryDataset[T](spark: SparkSession, minPartitions: Int)(generator: => Gen[T])arbitrarySizedDataset[T](spark: SparkSession, minPartitions: Int)(generator: Int => Gen[T])Tests Added
Documentation
Test Results:
Note: RDDGenerator already uses SparkContext (not SQLContext), so no changes were needed there.
Backward Compatibility: All existing SQLContext-based methods remain functional with their default parameters, ensuring no breaking changes for existing code. SparkSession-based methods extract the SQLContext and delegate to the existing methods, making it easy to remove SQLContext support in the future when appropriate.
Rebase: This PR has been rebased onto the main branch to incorporate the latest changes including Scala 2.11 drop and version upgrades.
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.