Add watchdog functionalities#600
Conversation
AI-Memory
left a comment
There was a problem hiding this comment.
Some of commits in this PR looks like intermediate patches, so it would be better if squash some of them with more meaningful commit comments, Thanks.
- executor now heartbeats to send its current status and receive command from scheduler - scheduler can now find out lost executor(s) and monitor their status, and fail the corresponding task(s) when necessary - users can cancel a task, new API has been added - new tests for task cancelation and dangling task detection
| ./teaclave_authentication_service & | ||
| ./teaclave_storage_service & | ||
| sleep 3 # wait for authentication and storage service | ||
| sleep 10 # wait for authentication and storage service |
There was a problem hiding this comment.
IMO, it may suggest a new feature similar to readiness probe of k8s.
There was a problem hiding this comment.
Agree. I think probing may be added in the following commits
| # or something unexpected happens, | ||
| # you may uncomment the following lines to cancel the task | ||
| # time.sleep(3) | ||
| # print("[+] canceling task") |
There was a problem hiding this comment.
This code comment could be removed if it is incomplete.
There was a problem hiding this comment.
The code can be uncomment to cancel an executing task. It means to demo calling the new cancel_task API and provide users an example. This example will run for a long while, so users may want to know how to cancel a task running for a long period without response.
| # or something unexpected happens, | ||
| # you may uncomment the following lines to cancel the task | ||
| # time.sleep(3) | ||
| # print("[+] canceling task") |
There was a problem hiding this comment.
The code can be uncomment to cancel an executing task. It means to demo calling the new cancel_task API and provide users an example. This example will run for a long while, so users may want to know how to cancel a task running for a long period without response.
| ./teaclave_authentication_service & | ||
| ./teaclave_storage_service & | ||
| sleep 3 # wait for authentication and storage service | ||
| sleep 10 # wait for authentication and storage service |
There was a problem hiding this comment.
Agree. I think probing may be added in the following commits
Description
Clients can now cancel a task, even when the task is executing;
Teaclave can detect
dangling tasks, which means the executor can send heartbeat packets for liveness report.Currently Client Rust API changes are missing.
Fixes # (issue)
Type of change (select or add applied and delete the others)
How has this been tested?
Checklist
master.