Skip to content

Add an assertColumnEquality method to allow for tests with less code#255

Closed
MrPowers wants to merge 1 commit intomainfrom
add_assert_column_equality
Closed

Add an assertColumnEquality method to allow for tests with less code#255
MrPowers wants to merge 1 commit intomainfrom
add_assert_column_equality

Conversation

@MrPowers
Copy link
Collaborator

@MrPowers MrPowers commented Jul 28, 2018

Hi @holdenk 😄

I've been using the assertColumnEquality for most of my Spark testing needs and have found that it allows for tests that require less code and run faster. I'd like to add this function to spark-testing-base, so more Spark users have a better testing experience!

Here's an example test with assertDataFrameEquals (uses createDF from spark-daria):

def myAddFunction(colName1: String, colName2: String): Column = {
  col(colName1) + col(colName2)
}

val actualDF = spark.createDF(
  List(
    (1, 3),
    (5, 3)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true)
  )
).withColumn(
    "the_sum",
    myAddFunction("num1", "num2")
  )

val expectedDF = spark.createDF(
  List(
    (1, 3, 4),
    (5, 3, 8)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true),
    ("the_sum", IntegerType, true)
  )
)

assertDataFrameEquals(actualDF, expectedDF)

Here's the same test with assertColumnEquality:

val df = spark.createDF(
  List(
    (1, 3, 4),
    (5, 3, 8)
  ), List(
    ("num1", IntegerType, true),
    ("num2", IntegerType, true),
    ("expected", IntegerType, true)
  )
).withColumn(
    "the_sum",
    myAddFunction("num1", "num2")
  )

assertColumnEquality(df, "expected", "the_sum")

assertColumnEquality lets us reduce the test code from 25 lines to 15 lines.

I think assertColumnEquality runs faster for the following reasons:

  • creating one DataFrame is faster than creating two DataFrames
  • The collect() method runs faster than zipWithIndex()
  • We're not caching DataFrames with expected.rdd.cache and result.rdd.cache

assertDataFrameEquals will still be better for large DataFrame comparisons or multi-column comparisons.

This PR just contains an initial implementation. If you like this idea, we can merge it in and then work on making the error message pretty. It's hard to spot the row differences in the following error message:

+-------+-------------+
|   name|expected_name|
+-------+-------------+
|   phil|         phil|
| rashid|       rashid|
|matthew|        mateo|
|   sami|         sami|
|     li|         feng|
|   null|         null|
+-------+-------------+

We will be able to add a pretty error message like this so it's easy for users to spot the rows that are causing their tests to fail:

assertcolumnequality_error_message

Thanks!

@holdensmagicalunicorn
Copy link

@MrPowers, thanks! I am a bot who has found some folks who might be able to help with the review:@holdenk and @mahmoudhanafy

@holdenk
Copy link
Owner

holdenk commented Sep 1, 2025

I know this is from forever ago :p But the downside of the collect here is that then we can't assert equality on truly large dataframes. I think your very much onto something -- perhaps we could update the code to use collect() when running in local mode and the records therefor must clearly fit into memory. WDYT?

@holdenk
Copy link
Owner

holdenk commented Nov 16, 2025

So I think if we do the filter in advance then the collect it should be "fine" for column equality.

holdenk added a commit that referenced this pull request Nov 16, 2025
Co-authored-by: MrPowers <matthewkevinpowers@gmail.com>
@holdenk
Copy link
Owner

holdenk commented Nov 16, 2025

I've made an updated version in a new PR

@holdenk holdenk closed this Nov 16, 2025
holdenk added a commit that referenced this pull request Nov 18, 2025
Co-authored-by: MrPowers <matthewkevinpowers@gmail.com>
holdenk added a commit that referenced this pull request Nov 20, 2025
…467)

Co-authored-by: MrPowers <matthewkevinpowers@gmail.com>
Copilot AI pushed a commit that referenced this pull request Nov 20, 2025
…467)

Co-authored-by: MrPowers <matthewkevinpowers@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants