Currently we use customer retry policy specified in function configuration for alloc failures due to customer errors.
We don't increment retries counters on infra failures (like internal errors). This is the right behavior but this results in infinite retries during prolonged infra issues.
We need a separate retry policy with a separate alloc retry counter for infra failures. We need to decide first on the policy. This is a product decision with certain trade-offs on what UX we want for customers during i.e. prolonged infra issues.