Skip to content

Improve dbt seed times by removing type lookup and casting during INSERT#493

Closed
nrichards17 wants to merge 1 commit into
databricks:mainfrom
tailorcare:faster-dbt-seeds
Closed

Improve dbt seed times by removing type lookup and casting during INSERT#493
nrichards17 wants to merge 1 commit into
databricks:mainfrom
tailorcare:faster-dbt-seeds

Conversation

@nrichards17
Copy link
Copy Markdown
Contributor

Resolves #476

Description

Several users have noticed slow run times for loading dbt seeds with >1k records when using the dbt-databricks adapter, with run times becoming prohibitively slow for seeds with >10k records. This PR speeds up that run time substantially by removing unnecessary type lookup and casting during the INSERT statement.

DBT seeds essentially are built in two steps:

  1. The table is created with the appropriate column types (explicit or inferred) with CREATE TABLE AS ...
  2. The values are loaded into that table with INSERT OVERWRITE INTO table.schema VALUES ...

For some reason in the second step, there is both a type lookup and subsequent CAST(x) AS type for every single value (rows x columns) in the seed. This is effectively redundant and unnecessary, since the type information was already used when defining the column types during table creation.

Removing these steps significantly speeds up the seed run times. For example, I was able to load a seed with 47k records and 9 columns in about 1 minute with this change, whereas previously the seed hadn't even finished after 10+ minutes of loading.

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@benc-db
Copy link
Copy Markdown
Collaborator

benc-db commented Oct 27, 2023

@nrichards17 I'm on vacation next week, but I'll run this change against our test suite when I get back, and assuming it passes, will try to sneak it in before I release 1.7.0.

@benc-db
Copy link
Copy Markdown
Collaborator

benc-db commented Oct 27, 2023

Thanks for the PR, appreciate it.

@nrichards17
Copy link
Copy Markdown
Contributor Author

happy to help, thank you @benc-db !

@benc-db
Copy link
Copy Markdown
Collaborator

benc-db commented Nov 7, 2023

Closing in favor of 498 which can run against our infra. Thanks @nrichards17!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dbt seed never completing

3 participants