Skip to content

Distributed scheduler #2002

Merged
DiegoTavares merged 98 commits intoAcademySoftwareFoundation:new-schedulerfrom
DiegoTavares:distributed_scheduler_2
Dec 12, 2025
Merged

Distributed scheduler #2002
DiegoTavares merged 98 commits intoAcademySoftwareFoundation:new-schedulerfrom
DiegoTavares:distributed_scheduler_2

Conversation

@DiegoTavares
Copy link
Copy Markdown
Collaborator

@DiegoTavares DiegoTavares commented Sep 27, 2025

This PR introduces a new module called "scheduler." This module is responsible for the booking aspect of Cuebot and is designed to offload this feature from the central module.

Rationale: Cuebot's booking logic depends on responding to each HostReport with a new task that searches for layers to dispatch to the reporting host. Consequently, each request generates a BookingQuery, which significantly impacts the database. As a result, scaling Cuebot is limited by the need to optimize database capacity to handle complex queries. This new module alleviates the booking workload from Cuebot.

Booking on the Scheduler is not triggered by host reports; instead, it operates through an internal loop that searches for pending jobs and seeks suitable matches from a cached view of the hosts in the database. The scheduler organizes layers and hosts into clusters, with each cluster representing a group of show and allocation combinations. This structure allows multiple instances of the scheduler to share the load without competing for work, which is a significant issue in Cuebot.

To enable Cuebot and the Scheduler to run concurrently without competing for work, a new feature was added to Cuebot, as detailed in #2087. This feature allows for the addition of an exclusion list containing show and allocations that should not be booked, or it can halt booking for all shows altogether.

- Implement FrameRange and FrameSet structs to parse and represent complex frame range syntaxes
including stepped, inverse stepped, negative steps, and interleaved ranges - Support chunking
FrameSets into compact sub-ranges for dispatching - Integrate FrameSet chunking in RqdDispatcher for
precise frame chunking - Improve dispatch error handling with distinct error types - Update host DAO
and models to include allocation info for resource checks - Add .gitignore entry for /sandbox/kafka*
Signed-off-by: Diego Tavares <dtavares@imageworks.com>
The producer module produces events on kafka for each pending job. The consumer modules consume
events and books jobs on host, still relying on the database.
This version still contains an issue when executing multiple tests at the same time, as tests are
sharing a database instance an they rely on it existing to work.
Optimized async + pgpool interaction, but still far from perfect.
Last commit before giving up on dashmap
There is a protection against processing multiple bookings on a single host at the same time on
HostDao that uses a database lock. This protection is intended for multiple instances of the
scheduler running at the same time. However, this logic was also being triggered by a single
instance, which indicated there was a race condition in place.

The race condition happens because hosts can belong to multiple groups at the same time.
@DiegoTavares DiegoTavares force-pushed the distributed_scheduler_2 branch from f0a4ffc to 06ad353 Compare December 3, 2025 01:56
@DiegoTavares DiegoTavares marked this pull request as ready for review December 8, 2025 23:39
@DiegoTavares
Copy link
Copy Markdown
Collaborator Author

Sorry for the huge PR. Making incremental stacked PRs starting now until this is merged.

@DiegoTavares DiegoTavares changed the title [POC] Distributed scheduler Distributed scheduler Dec 11, 2025
@DiegoTavares
Copy link
Copy Markdown
Collaborator Author

A new PR will be created to handle documenting this new module. Adding docs to this PR would make it too big to be reviewed.

@DiegoTavares DiegoTavares changed the base branch from master to new-scheduler December 12, 2025 17:42
@DiegoTavares DiegoTavares merged commit b02b608 into AcademySoftwareFoundation:new-scheduler Dec 12, 2025
22 checks passed
DiegoTavares added a commit that referenced this pull request Dec 16, 2025
This PR introduces a new module called "scheduler." This module is
responsible for the booking aspect of Cuebot and is designed to offload
this feature from the central module.

Rationale: Cuebot's booking logic depends on responding to each
HostReport with a new task that searches for layers to dispatch to the
reporting host. Consequently, each request generates a
[BookingQuery](https://github.com/AcademySoftwareFoundation/OpenCue/blob/master/cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java),
which significantly impacts the database. As a result, scaling Cuebot is
limited by the need to optimize database capacity to handle complex
queries. This new module alleviates the booking workload from Cuebot.

Booking on the Scheduler is not triggered by host reports; instead, it
operates through an internal loop that searches for pending jobs and
seeks suitable matches from a cached view of the hosts in the database.
The scheduler organizes layers and hosts into clusters, with each
cluster representing a group of show and allocation combinations. This
structure allows multiple instances of the scheduler to share the load
without competing for work, which is a significant issue in Cuebot.

To enable Cuebot and the Scheduler to run concurrently without competing
for work, a new feature was added to Cuebot, as detailed in
#2087. This
feature allows for the addition of an exclusion list containing show and
allocations that should not be booked, or it can halt booking for all
shows altogether.

---------

Signed-off-by: Diego Tavares <dtavares@imageworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants