Describe the bug
Hi, I'm experiencing an issue when running python models from terminated clusters. Feels related to #306 (comment) and #306 in general. Tagging for visibility since you've responded on the other thread.
I use python models in conjunction with SQL models (no surprise there) but I configure each python model to have its their own http_path so that I can run the python models in conjunction with the SQL models from a single dbt run --target <sql_warehouse> command. So the top level "run" points at SQL and each python model points at python.
Some of the python models run, others experience an error like:
{'error': 'ClusterNotReadyException: Cluster <cluster_id> not currently ready for driver client (currently Pending)'}
Steps To Reproduce
Not 100% sure, but I would say:
- Run
dbt run against a SQL warehouse with python-model level http_paths pointing to an all purpose cluster.
- Ensure some SQL models run first.
- Ensure 5+ python models all get hit in the dag around the same time (within the window of terminated -> running, say 2 minutes).
Expected behavior
I expect for the all purpose cluster to boot when it encounters the first python model and then for all other simultaneous models to realize its booting and wait until the cluster is ready.
Screenshots and log output
See bug description.
System information
The output of dbt --version:
core is 1.5.0
spark is 1.5.0
databricks is 1.5.0
The operating system you're using:
Linux
The output of python --version:
3.10.4
Additional context
All the python models point to the same http_path. This feels like a recipe for the race condition @susodapop described in #306 where by:
- the first python model to run gets a
TERMINATED cluster and then runs start_cluster as intended, then
- the second one sees that the cluster is neither
TERMINATED nor TERMINATING and then skips the if statement, stumbles in to the create context call and dies a death a la: Cluster <cluster_id> not currently ready for driver client (currently Pending).
- This seems to be re-enforced by the fact that a quick search shows the "pending" status does not appear in any of the python code, so I'm guessing unhandled at the moment?
I think an elegant solution might be to shift the waiting part (seen here) out of DBContext.start_cluster into its own function DBContext.wait_for_running_cluster and then call this in DBContext.create directly after (but outside) the TERMINATED conditional:
- This adds at least one
get_cluster_status call to each context creation, but that feels right since we need to know its RUNNING before context can be created.
- This also delegates the "shared state" of the separate threads (is my cluster ready or not) to the API, which is easy to share amongst the threads.
- That said, I've only skimmed the code, could be the wrong approach.
My work around for now is to boot the cluster at the start of the job in a separate API call. This assumes the cluster doesn't auto-terminate before the python models run, which isn't great.
Describe the bug
Hi, I'm experiencing an issue when running python models from terminated clusters. Feels related to #306 (comment) and #306 in general. Tagging for visibility since you've responded on the other thread.
I use python models in conjunction with SQL models (no surprise there) but I configure each python model to have its their own
http_pathso that I can run the python models in conjunction with the SQL models from a singledbt run --target <sql_warehouse>command. So the top level "run" points at SQL and each python model points at python.Some of the python models run, others experience an error like:
Steps To Reproduce
Not 100% sure, but I would say:
dbt runagainst a SQL warehouse with python-model levelhttp_paths pointing to an all purpose cluster.Expected behavior
I expect for the all purpose cluster to boot when it encounters the first python model and then for all other simultaneous models to realize its booting and wait until the cluster is ready.
Screenshots and log output
See bug description.
System information
The output of
dbt --version:core is 1.5.0
spark is 1.5.0
databricks is 1.5.0
The operating system you're using:
Linux
The output of
python --version:3.10.4
Additional context
All the python models point to the same
http_path. This feels like a recipe for the race condition @susodapop described in #306 where by:TERMINATEDcluster and then runsstart_clusteras intended, thenTERMINATEDnorTERMINATINGand then skips the if statement, stumbles in to the create context call and dies a death a la:Cluster <cluster_id> not currently ready for driver client (currently Pending).I think an elegant solution might be to shift the waiting part (seen here) out of
DBContext.start_clusterinto its own functionDBContext.wait_for_running_clusterand then call this inDBContext.createdirectly after (but outside) theTERMINATEDconditional:get_cluster_statuscall to each context creation, but that feels right since we need to know itsRUNNINGbefore context can be created.My work around for now is to boot the cluster at the start of the job in a separate API call. This assumes the cluster doesn't auto-terminate before the python models run, which isn't great.