-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Depends on #7614
Depends on #9310
The issue
DCR as a service in our infrastructure design (AWS) is a single point of failure as it serves content to multiple other swimlaned micro-services i.e.
Are all served via DCR's 1 service.
This creates a host of issues, which we are starting to see real-world examples of happening, illustrated below, and we should look to address it before these issues become more impactful as we start to serve more traffic to the service via the apps rendering work.
Current request flow
---
title: Current request flow
---
graph LR
Request-->Fastly
Fastly-->Router
Router-->FEArticleLB
Router-->FEFaciaLB
Router-->FEApplicationsLB
subgraph Frontend
FEArticleLB-->FEArticleApp
FEFaciaLB-->FEFaciaApp
FEApplicationsLB-->FEApplicationsApp
end
FEArticleApp-->DcrLB
FEFaciaApp-->DcrLB
FEApplicationsApp-->DcrLB
DcrLB-->DCR
Solutions
Co-Authored-By: @AshCorr
Co-Authored-By: @arelra
While we could create completely new apps for each service, this would create a load of upfront work to address the immediate issue of DCR being a bottleneck.
We have suggested that we stick to splitting the infrastructure first to mitigate the risks there. This would include:
## Tasklist
- [ ] https://github.com/guardian/dotcom-rendering/issues/9310
- [ ] https://github.com/guardian/dotcom-rendering/issues/9323
- [ ] https://github.com/guardian/dotcom-rendering/issues/9324
- [ ] https://github.com/guardian/dotcom-rendering/issues/9325
- [x] Loading the config of those values into each services SSM for [rendering.baseURL](https://github.com/guardian/frontend/blob/main/common/app/common/configuration.scala#L152)
- [ ] Testing testing
As mentioned above - @AshCorr has suggested that we move to CDK (#7614) first to make this easier and more seamless.
The ongoing work with making apps webviews available via DCR is already uncovering how we might architect the application itself to be more geared towards a more micro-frontend structure, but is out of scope of this issue.
Suggested request flow
Option 1: One LB per app
---
title: Suggested request flow
---
graph LR
Request-->Fastly
Fastly-->Router
Router-->FEArticleLB
Router-->FEFaciaLB
Router-->FEApplicationsLB
subgraph Frontend
FEArticleLB-->FEArticleApp
FEFaciaLB-->FEFaciaApp
FEApplicationsLB-->FEApplicationsApp
end
FEArticleApp--/Article-->DcrArticleLB-->DcrApp1
FEFaciaApp--/Front-->DcrFaciaLB-->DcrApp2
FEApplicationsApp--/Interactives-->DcrApplicationLB-->DcrApp3
Option 2: One LB for all DCR apps
---
title: Suggested request flow
---
graph LR
Request-->Fastly
Fastly-->Router
Router-->FEArticleLB
Router-->FEFaciaLB
Router-->FEApplicationsLB
subgraph Frontend
FEArticleLB-->FEArticleApp
FEFaciaLB-->FEFaciaApp
FEApplicationsLB-->FEApplicationsApp
end
FEArticleApp--/Article-->DcrLB-->DcrArticleApp1
FEFaciaApp--/Front-->DcrLB-->DcrFaciaApp2
FEApplicationsApp--/Interactives-->DcrLB-->DcrApplicationsApp3
An example of where else we do this is in MAPI via microservice CDK (thanks @JamieB-gu)
We had a chat with @akash1810 and they suggested it would be good to talk to someone from AWS to discuss which option would be better to go with considering the amount of traffic the load balancer(s) would get.
Depends on #9310
Examples
Performance of one app affects another
- We receive a blast of traffic to interactives e.g.
- This locks up the threads in DCR due to the size of the JSON being parsed in those articles
- DCR slows in terms of performance
⚠️ Point of failure: DCR articles and fronts are also served slow
Traffic to one app affects another
- We receive a blast of traffic to fronts
- This goes through
router➡️frontend/facia➡️ DCR - DCR slows in terms of performance
⚠️ Point of failure: DCR articles are also served slow
Unnecessary scaling of services
- Traffic to articles increase
frontend/articlescales up- DCR in turn scales up
⚠️ Point of failure:frontend/faciais now running at the new scaled up version
Error handling
- We bork something on the
/Articleendpoint - This pushes 500s to
frontend/article - This bubbles through our request pipeline
- Cache will catch a lot of this, but we will see a larger % of traffic to origin to try get the valid response
⚠️ Point of failure:/Frontendpoint slows down due to massive increase in traffic