Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path

~Depends on #7614~
Depends on https://github.com/guardian/dotcom-rendering/issues/9310

## The issue

DCR as a service in our infrastructure design (AWS) is a single point of failure as it serves content to multiple other swimlaned micro-services i.e.
* [`frontend/article`](https://github.com/guardian/frontend/blob/22c84f55d331cfefd707b27f79adfeecad2a0828/article/app/controllers/ArticleController.scala#L123-L133) (and [LiveBlogs](https://github.com/guardian/frontend/blob/22c84f55d331cfefd707b27f79adfeecad2a0828/article/app/controllers/LiveBlogController.scala#L192-L207))
* [`frontend/facia`](https://github.com/guardian/frontend/blob/22c84f55d331cfefd707b27f79adfeecad2a0828/article/app/controllers/ArticleController.scala#L123-L133)
* [`frontend/applications` Interactives](https://github.com/guardian/frontend/blob/22c84f55d331cfefd707b27f79adfeecad2a0828/applications/app/controllers/InteractiveController.scala#L97C1-L113)

[Are all served](https://github.com/guardian/dotcom-rendering/blob/main/dotcom-rendering/src/server/prod-server.ts#L62-L77) via [DCR's 1 service](https://github.com/guardian/dotcom-rendering/blob/main/dotcom-rendering/cloudformation.yml).

This creates a host of issues, which we are starting to see real-world examples of happening, [illustrated below](#Examples), and we should look to address it before these issues become more impactful as we start to serve more traffic to the service via the apps rendering work.

## Current request flow
```mermaid
---
title: Current request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp-->DcrLB
    FEFaciaApp-->DcrLB
    FEApplicationsApp-->DcrLB
    DcrLB-->DCR
```

## Solutions

Co-Authored-By: @AshCorr 
Co-Authored-By: @arelra  

While we could create completely new apps for each service, this would create a load of upfront work to address the immediate issue of DCR being a bottleneck.

We have suggested that we stick to splitting the infrastructure first to mitigate the risks there. This would include:

```[tasklist]
## Tasklist
- [ ] https://github.com/guardian/dotcom-rendering/issues/9310
- [ ] https://github.com/guardian/dotcom-rendering/issues/9323
- [ ] https://github.com/guardian/dotcom-rendering/issues/9324
- [ ] https://github.com/guardian/dotcom-rendering/issues/9325
- [x] Loading the config of those values into each services SSM for [rendering.baseURL](https://github.com/guardian/frontend/blob/main/common/app/common/configuration.scala#L152)
- [ ] Testing testing
```

As mentioned above - @AshCorr has suggested that we move to CDK (#7614) first to make this easier and more seamless. 

The ongoing work with making apps webviews available via DCR is already uncovering how we might architect the application itself to be more geared towards a more [micro-frontend](https://martinfowler.com/articles/micro-frontends.html) structure, but is out of scope of this issue.

## Suggested request flow

### Option 1: One LB per app
```mermaid
---
title: Suggested request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp--/Article-->DcrArticleLB-->DcrApp1
    FEFaciaApp--/Front-->DcrFaciaLB-->DcrApp2
    FEApplicationsApp--/Interactives-->DcrApplicationLB-->DcrApp3
```

### Option 2: One LB for all DCR apps

```mermaid
---
title: Suggested request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp--/Article-->DcrLB-->DcrArticleApp1
    FEFaciaApp--/Front-->DcrLB-->DcrFaciaApp2
    FEApplicationsApp--/Interactives-->DcrLB-->DcrApplicationsApp3
```

An example of where else we do this is in MAPI via [microservice CDK](https://github.com/guardian/mobile-apps-api/blob/main/cdk/lib/microservice.ts) (thanks @JamieB-gu) 

We had a chat with @akash1810 and they suggested it would be good to talk to someone from AWS to discuss which option would be better to go with considering the amount of traffic the load balancer(s) would get.

Depends on https://github.com/guardian/dotcom-rendering/issues/9310

---

## Examples

### Performance of one app affects another
* We receive a blast of traffic to interactives [e.g.](https://www.theguardian.com/education/ng-interactive/2020/sep/05/best-uk-universities-for-education-league-table)
* This locks up the threads in DCR due to the size of the JSON being parsed in those articles
* DCR slows in terms of performance
* ⚠️ Point of failure: DCR articles and fronts are also served slow 


### Traffic to one app affects another
* We receive a blast of traffic to fronts
* This goes through `router` ➡️ `frontend/facia` ➡️ DCR
* DCR slows in terms of performance
* ⚠️ Point of failure: DCR articles are also served slow 


### Unnecessary scaling of services
* Traffic to articles increase
* `frontend/article` scales up
* DCR in turn scales up
* ⚠️ Point of failure:`frontend/facia` is now running at the new scaled up version 


### Error handling
* We bork something on the `/Article` endpoint
* This pushes 500s to `frontend/article`
* This bubbles through our request pipeline
* Cache will catch a lot of this, but we will see a larger % of traffic to origin to try get the valid response
* ⚠️ Point of failure: `/Front` endpoint slows down due to massive increase in traffic

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

The issue

Current request flow

Solutions

Suggested request flow

Option 1: One LB per app

Option 2: One LB for all DCR apps

Examples

Performance of one app affects another

Traffic to one app affects another

Unnecessary scaling of services

Error handling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

Description

The issue

Current request flow

Solutions

Suggested request flow

Option 1: One LB per app

Option 2: One LB for all DCR apps

Examples

Performance of one app affects another

Traffic to one app affects another

Unnecessary scaling of services

Error handling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions