Skip to content

Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

@jamesgorrie

Description

@jamesgorrie

Depends on #7614
Depends on #9310

The issue

DCR as a service in our infrastructure design (AWS) is a single point of failure as it serves content to multiple other swimlaned micro-services i.e.

Are all served via DCR's 1 service.

This creates a host of issues, which we are starting to see real-world examples of happening, illustrated below, and we should look to address it before these issues become more impactful as we start to serve more traffic to the service via the apps rendering work.

Current request flow

---
title: Current request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp-->DcrLB
    FEFaciaApp-->DcrLB
    FEApplicationsApp-->DcrLB
    DcrLB-->DCR
Loading

Solutions

Co-Authored-By: @AshCorr
Co-Authored-By: @arelra

While we could create completely new apps for each service, this would create a load of upfront work to address the immediate issue of DCR being a bottleneck.

We have suggested that we stick to splitting the infrastructure first to mitigate the risks there. This would include:

## Tasklist
- [ ] https://github.com/guardian/dotcom-rendering/issues/9310
- [ ] https://github.com/guardian/dotcom-rendering/issues/9323
- [ ] https://github.com/guardian/dotcom-rendering/issues/9324
- [ ] https://github.com/guardian/dotcom-rendering/issues/9325
- [x] Loading the config of those values into each services SSM for [rendering.baseURL](https://github.com/guardian/frontend/blob/main/common/app/common/configuration.scala#L152)
- [ ] Testing testing

As mentioned above - @AshCorr has suggested that we move to CDK (#7614) first to make this easier and more seamless.

The ongoing work with making apps webviews available via DCR is already uncovering how we might architect the application itself to be more geared towards a more micro-frontend structure, but is out of scope of this issue.

Suggested request flow

Option 1: One LB per app

---
title: Suggested request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp--/Article-->DcrArticleLB-->DcrApp1
    FEFaciaApp--/Front-->DcrFaciaLB-->DcrApp2
    FEApplicationsApp--/Interactives-->DcrApplicationLB-->DcrApp3
Loading

Option 2: One LB for all DCR apps

---
title: Suggested request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp--/Article-->DcrLB-->DcrArticleApp1
    FEFaciaApp--/Front-->DcrLB-->DcrFaciaApp2
    FEApplicationsApp--/Interactives-->DcrLB-->DcrApplicationsApp3
Loading

An example of where else we do this is in MAPI via microservice CDK (thanks @JamieB-gu)

We had a chat with @akash1810 and they suggested it would be good to talk to someone from AWS to discuss which option would be better to go with considering the amount of traffic the load balancer(s) would get.

Depends on #9310


Examples

Performance of one app affects another

  • We receive a blast of traffic to interactives e.g.
  • This locks up the threads in DCR due to the size of the JSON being parsed in those articles
  • DCR slows in terms of performance
  • ⚠️ Point of failure: DCR articles and fronts are also served slow

Traffic to one app affects another

  • We receive a blast of traffic to fronts
  • This goes through router ➡️ frontend/facia ➡️ DCR
  • DCR slows in terms of performance
  • ⚠️ Point of failure: DCR articles are also served slow

Unnecessary scaling of services

  • Traffic to articles increase
  • frontend/article scales up
  • DCR in turn scales up
  • ⚠️ Point of failure:frontend/facia is now running at the new scaled up version

Error handling

  • We bork something on the /Article endpoint
  • This pushes 500s to frontend/article
  • This bubbles through our request pipeline
  • Cache will catch a lot of this, but we will see a larger % of traffic to origin to try get the valid response
  • ⚠️ Point of failure: /Front endpoint slows down due to massive increase in traffic

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions