Distributed scheduler by DiegoTavares · Pull Request #2002 · AcademySoftwareFoundation/OpenCue

DiegoTavares · 2025-09-27T01:19:05Z

This PR introduces a new module called "scheduler." This module is responsible for the booking aspect of Cuebot and is designed to offload this feature from the central module.

Rationale: Cuebot's booking logic depends on responding to each HostReport with a new task that searches for layers to dispatch to the reporting host. Consequently, each request generates a BookingQuery, which significantly impacts the database. As a result, scaling Cuebot is limited by the need to optimize database capacity to handle complex queries. This new module alleviates the booking workload from Cuebot.

Booking on the Scheduler is not triggered by host reports; instead, it operates through an internal loop that searches for pending jobs and seeks suitable matches from a cached view of the hosts in the database. The scheduler organizes layers and hosts into clusters, with each cluster representing a group of show and allocation combinations. This structure allows multiple instances of the scheduler to share the load without competing for work, which is a significant issue in Cuebot.

To enable Cuebot and the Scheduler to run concurrently without competing for work, a new feature was added to Cuebot, as detailed in #2087. This feature allows for the addition of an exclusion list containing show and allocations that should not be booked, or it can halt booking for all shows altogether.

- Implement FrameRange and FrameSet structs to parse and represent complex frame range syntaxes including stepped, inverse stepped, negative steps, and interleaved ranges - Support chunking FrameSets into compact sub-ranges for dispatching - Integrate FrameSet chunking in RqdDispatcher for precise frame chunking - Improve dispatch error handling with distinct error types - Update host DAO and models to include allocation info for resource checks - Add .gitignore entry for /sandbox/kafka*

Signed-off-by: Diego Tavares <dtavares@imageworks.com>

The producer module produces events on kafka for each pending job. The consumer modules consume events and books jobs on host, still relying on the database.

This version still contains an issue when executing multiple tests at the same time, as tests are sharing a database instance an they rely on it existing to work.

Optimized async + pgpool interaction, but still far from perfect.

Last commit before giving up on dashmap

There is a protection against processing multiple bookings on a single host at the same time on HostDao that uses a database lock. This protection is intended for multiple instances of the scheduler running at the same time. However, this logic was also being triggered by a single instance, which indicated there was a race condition in place. The race condition happens because hosts can belong to multiple groups at the same time.

Besides that, use host_stats for up-to-date memory information when updating the host cache.

To simplify testing, these changes are being migrated to a new PR

Entries were migrated to a new PR isolating the feature they were related to

The new option is define as: ```yaml ```

Signed-off-by: Diego Tavares <dtavares@imageworks.com>

DiegoTavares · 2025-12-08T23:42:10Z

Sorry for the huge PR. Making incremental stacked PRs starting now until this is merged.

DiegoTavares · 2025-12-11T22:48:45Z

A new PR will be created to handle documenting this new module. Adding docs to this PR would make it too big to be reviewed.

This PR introduces a new module called "scheduler." This module is responsible for the booking aspect of Cuebot and is designed to offload this feature from the central module. Rationale: Cuebot's booking logic depends on responding to each HostReport with a new task that searches for layers to dispatch to the reporting host. Consequently, each request generates a [BookingQuery](https://github.com/AcademySoftwareFoundation/OpenCue/blob/master/cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java), which significantly impacts the database. As a result, scaling Cuebot is limited by the need to optimize database capacity to handle complex queries. This new module alleviates the booking workload from Cuebot. Booking on the Scheduler is not triggered by host reports; instead, it operates through an internal loop that searches for pending jobs and seeks suitable matches from a cached view of the hosts in the database. The scheduler organizes layers and hosts into clusters, with each cluster representing a group of show and allocation combinations. This structure allows multiple instances of the scheduler to share the load without competing for work, which is a significant issue in Cuebot. To enable Cuebot and the Scheduler to run concurrently without competing for work, a new feature was added to Cuebot, as detailed in #2087. This feature allows for the addition of an exclusion list containing show and allocations that should not be booked, or it can halt booking for all shows altogether. --------- Signed-off-by: Diego Tavares <dtavares@imageworks.com>

DiegoTavares added 24 commits July 10, 2025 10:33

Setup new module for a job queue service

93c8317

Initial version of the distributed job-scheduler

425a393

[draft] dispatcher

1068970

Add job_resource cores limits to host_dao query

71aa58b

Merge branch 'master' into distributed_scheduler

378dde3

Signed-off-by: Diego Tavares <dtavares@imageworks.com>

Implement scheduler using kafka

c0a6193

The producer module produces events on kafka for each pending job. The consumer modules consume events and books jobs on host, still relying on the database.

Compiles

afb000c

Make all memory fields bytesize

75ac963

Fix database memory values from bytes to kb

1596d02

Implement cluster logic using facility+show+tag

dd3d41a

Fix layer host candidate loop

40d8a2b

Remove dead files and old TODOs

b147b7f

Add integration tests

94d7a6c

This version still contains an issue when executing multiple tests at the same time, as tests are sharing a database instance an they rely on it existing to work.

Rename and refactor integration_tests to smoke_tests

d04eaa9

WIP: Add scheduler stress tests

804f2ae

Minor fixes

78ecb0e

First working stress tests

623fcc3

Update job fetcher to use fetch_all and stream processing

fd6305e

Optimized async + pgpool interaction, but still far from perfect.

Refactor modules

d959d16

Batch layer and frame queries

1b7a9ad

Fixed several dashmap related deadlocks

2b97f12

Last commit before giving up on dashmap

Convert host_cache to scc

3216258

Wrap HostCache in an Actor System using actix

2e9f651

DiegoTavares mentioned this pull request Sep 27, 2025

[POC] Distributed scheduler #1809

Closed

DiegoTavares added 5 commits September 26, 2025 18:27

Remove unecessary debug statements

55f5e41

Migrate Dispatcher interface into an Actor

733f366

Migrate Dispatcher interface into an Actor

2dee076

Clean up warnings

2af6c54

DiegoTavares added 11 commits November 20, 2025 16:34

Add debug log when host is considered stale in cache

ac35982

Change Id types to Uuid

178ba52

Replace host.ts_last_updated by host_stat.ts_ping

11aab71

Besides that, use host_stats for up-to-date memory information when updating the host cache.

Set PostgreSQL connection to UTC timezone

0e9bc63

Remove custom timestamp layers from tracing logs

e5d60bb

Fix case issue on facility pk on hos_dao

c3581e6

Revert cuebot changes

8d35a31

To simplify testing, these changes are being migrated to a new PR

Remove unused opencue.properties entries

fafef7e

Entries were migrated to a new PR isolating the feature they were related to

Add retry count tracking for dispatched frames

00c318c

Add option to ignore a list of tags

9ffcb65

The new option is define as: ```yaml ```

Make host_dao facility query case-insensitive

06ad353

DiegoTavares force-pushed the distributed_scheduler_2 branch from f0a4ffc to 06ad353 Compare December 3, 2025 01:56

DiegoTavares added 3 commits December 4, 2025 11:38

Add configurable frame memory soft and hard limits

d11b51d

Update Cluster type to include facility ID

2faa2cf

Merge branch 'master' into distributed_scheduler_2

0b62d3b

Signed-off-by: Diego Tavares <dtavares@imageworks.com>

DiegoTavares marked this pull request as ready for review December 8, 2025 23:39

DiegoTavares requested review from lithorus and ramonfigueiredo as code owners December 8, 2025 23:39

Remove local files

5699d77

DiegoTavares changed the title ~~[POC] Distributed scheduler~~ Distributed scheduler Dec 11, 2025

DiegoTavares mentioned this pull request Dec 11, 2025

[scheduler/cuebot/pycue/rqd/pyoutline] Booking by slot limit #2101

Closed

6 tasks

Merge branch 'master' into distributed_scheduler_2

738ce31

Fix unit test

628c878

DiegoTavares changed the base branch from master to new-scheduler December 12, 2025 17:42

DiegoTavares merged commit b02b608 into AcademySoftwareFoundation:new-scheduler Dec 12, 2025
22 checks passed

DiegoTavares mentioned this pull request Dec 12, 2025

[scheduler/cuebot/pycue/rqd/pyoutline] Booking by Slot #2105

Closed

6 tasks

DiegoTavares mentioned this pull request Dec 17, 2025

[scheduler/cuebot/pycue/rqd/pyoutline] Booking by Slot #2115

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed scheduler #2002

Distributed scheduler #2002
DiegoTavares merged 98 commits intoAcademySoftwareFoundation:new-schedulerfrom
DiegoTavares:distributed_scheduler_2

DiegoTavares commented Sep 27, 2025 •

edited

Loading

Uh oh!

DiegoTavares commented Dec 8, 2025

Uh oh!

DiegoTavares commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DiegoTavares commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiegoTavares commented Dec 8, 2025

Uh oh!

DiegoTavares commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DiegoTavares commented Sep 27, 2025 •

edited

Loading