Skip to content

Multi tier database start/stop operations#4081

Merged
arbulu89 merged 7 commits intomainfrom
multi-tier-database-operations
Mar 30, 2026
Merged

Multi tier database start/stop operations#4081
arbulu89 merged 7 commits intomainfrom
multi-tier-database-operations

Conversation

@arbulu89
Copy link
Copy Markdown
Contributor

@arbulu89 arbulu89 commented Mar 16, 2026

Description

Enalbe multi-tier database start/stop operations.

  • Send system_replication_tier value to wanda during operation request. We just need to send one instance per site, that's why the uniq_by
  • Update backend policy to consider full database start/stop
  • Update backend policy heartbeat usage to check the heartbeat per site, instead of complete database
  • Update frontend to enable operation button based on heartbeat

Did you add the right label?

How was this tested?

UT and manual testing
@abravosuse I did a bunch of tests with a real scale up cluster, but feel free to use the env image to test it yourself (update wanda to latest version)

Did you update the documentation?

Pending to update user facing docs

@arbulu89 arbulu89 added enhancement New feature or request env Create an ephimeral environment for the pr branch labels Mar 16, 2026
@arbulu89 arbulu89 force-pushed the multi-tier-database-operations branch 2 times, most recently from 64b44e9 to 6a9ff90 Compare March 23, 2026 14:14
@arbulu89 arbulu89 force-pushed the multi-tier-database-operations branch 4 times, most recently from 8277be8 to 70fdb28 Compare March 26, 2026 14:29
@arbulu89 arbulu89 force-pushed the multi-tier-database-operations branch from 70fdb28 to cf93a8b Compare March 26, 2026 15:33
@arbulu89 arbulu89 requested a review from nelsonkopliku March 26, 2026 15:54
@arbulu89 arbulu89 marked this pull request as ready for review March 26, 2026 15:55
@abravosuse
Copy link
Copy Markdown
Contributor

@arbulu89 I am not going to be able to test this until past Eastern. If you are confident that things are working overall fine, go ahead and merge it and I will run a battery of tests from main when I get back.

Copy link
Copy Markdown
Member

@nelsonkopliku nelsonkopliku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Xabi, I just have a couple of questions.

For the rest looks good.

runningOperations,
database.id,
DATABASE_START,
matchesSite(null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: matchesSite(null) means "a database instance that doesn't seem to be on any site"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially true XD.
Basically, we need to this for the new Stop/start of full database, as we are not providing the site (better said, we are given null as site). This way the spinning icon on the operations button will be only in the top operations button.
The database instance can be sited, but we don't care, as we the operation applies to all sites.
Anyway, I said partially as it covers the scenario where system replication is not enabled that you mentioned.

Comment on lines +297 to +298
disabled={!someHostHeartbeatPassing}
disabledTooltip={operationNotAllowedMsg}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: help me remember, why are we removing hasSystemReplication from the check?
Is it because it would be caught by the API and eventually forbid the operation anyway?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now, in the databases view we had 2 ways of doing operations.

  • Site operations: the operation applies to a specific site and the button is in the site table
  • Operations for databases without system replication: As we didn't have site tables here, to start/stop a database we needed to use this button with the affected code change, but only it made sense in scenarios where system replication was not configured. That's why it was disabled as well.

Now, we can start/stop complete databases with system replication, so the condition must go away as we reuse this buttom.

Comment on lines +31 to +39
|> Enum.group_by(& &1.system_replication_site)
|> Enum.reject(fn {_site, grouped_instances} ->
Enum.any?(grouped_instances, fn %DatabaseInstanceReadModel{
host: %HostReadModel{heartbeat: heartbeat}
} ->
heartbeat == :passing
end)
end)
|> Enum.map(fn {site, _} -> site end)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: at the end we want a unique list of sites with all the instances in that site not passing.

Would it work if we swapped to the following?

instances
|> filter_by_site(params)
|> reject all instances with passing heartbeat
|> uniq by site

it might avoid us grouping by site and ungrouping later.

Copy link
Copy Markdown
Contributor Author

@arbulu89 arbulu89 Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: at the end we want a unique list of sites with all the instances in that site not passing.

Correct.

Would it work if we swapped to the following?

But if I reject before grouping, how do I know if there is some passing host in that site?
For example:
I have:

  • Site1: host1 passing, host 2 critical
  • Site2: host3 passing, host4 critical

If I reject all instances with passing, i would have [host 2, host 4]. And here, unique sites would be Site1, Site2. But in my case, I should have no sites, as each of them have a passing host as well.

I think doing some grouping is mandatory at some point

PD:
At the end, this is the reverse logic of what a "positive" filtering would be, but as I wanted to have the list of sites without passing heartbeat, we have the reject.
Initially I had:

instances
|> filter_by_site
|> group_by_site
# look if every site has at least one passing host
|> Enum.all?(fn {_site, grouped_instances} ->
   Enum.any?(grouped_instances, fn %DatabaseInstanceReadModel{
                                          host: %HostReadModel{heartbeat: heartbeat}
                                        } ->
          heartbeat == :passing
   end)
end)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got you, thanks!

Copy link
Copy Markdown
Member

@nelsonkopliku nelsonkopliku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@arbulu89 arbulu89 merged commit 9e9bd87 into main Mar 30, 2026
59 checks passed
@arbulu89 arbulu89 deleted the multi-tier-database-operations branch March 30, 2026 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request env Create an ephimeral environment for the pr branch

Development

Successfully merging this pull request may close these issues.

3 participants