Skip to content

Python Clusters not starting correctly: race conditions? #347

@ewengillies

Description

@ewengillies

Describe the bug

Hi, I'm experiencing an issue when running python models from terminated clusters. Feels related to #306 (comment) and #306 in general. Tagging for visibility since you've responded on the other thread.

I use python models in conjunction with SQL models (no surprise there) but I configure each python model to have its their own http_path so that I can run the python models in conjunction with the SQL models from a single dbt run --target <sql_warehouse> command.  So the top level "run" points at SQL and each python model points at python.

Some of the python models run, others experience an error like:

{'error': 'ClusterNotReadyException: Cluster <cluster_id> not currently ready for driver client (currently Pending)'}

Steps To Reproduce

Not 100% sure, but I would say:

  • Run dbt run against a SQL warehouse with python-model level http_paths pointing to an all purpose cluster.
  • Ensure some SQL models run first.
  • Ensure 5+ python models all get hit in the dag around the same time (within the window of terminated -> running, say 2 minutes).

Expected behavior

I expect for the all purpose cluster to boot when it encounters the first python model and then for all other simultaneous models to realize its booting and wait until the cluster is ready.

Screenshots and log output

See bug description.

System information

The output of dbt --version:
core is 1.5.0
spark is 1.5.0
databricks is 1.5.0

The operating system you're using:

Linux

The output of python --version:
3.10.4

Additional context

All the python models point to the same http_path.  This feels like a recipe for the race condition @susodapop described in #306 where by:

  1. the first python model to run gets a TERMINATED cluster and then runs start_cluster as intended, then
  2. the second one sees that the cluster is neither TERMINATED nor TERMINATING and then skips the if statement, stumbles in to the create context call and dies a death a la: Cluster <cluster_id> not currently ready for driver client (currently Pending).
  3. This seems to be re-enforced by the fact that a quick search shows the "pending" status does not appear in any of the python code, so I'm guessing unhandled at the moment?

I think an elegant solution might be to shift the waiting part (seen here) out of DBContext.start_cluster into its own function DBContext.wait_for_running_cluster and then call this in DBContext.create directly after (but outside) the TERMINATED conditional:

  1. This adds at least one get_cluster_status call to each context creation, but that feels right since we need to know its RUNNING before context can be created.
  2. This also delegates the "shared state" of the separate threads (is my cluster ready or not) to the API, which is easy to share amongst the threads.
  3. That said, I've only skimmed the code, could be the wrong approach.

My work around for now is to boot the cluster at the start of the job in a separate API call. This assumes the cluster doesn't auto-terminate before the python models run, which isn't great.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds more infoWaiting on response from user to gather more infopython

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions