Skip to content

Resiliency to a single degraded Availability Zone #939

@46bit

Description

@46bit

This issue is not a bug in cf-deployment, but it's to discuss solving a common incident for CF operators.

What is this issue about?

Several Cloud Foundry users have had outages when a single Availability Zone experiences a partial failure. Incidents like degraded networking are far more common in the Cloud than a complete outage.

Cloud Foundry is engineered to run in multiple AZs, but not to handle degraded single AZs. When a single AZ is only degraded, Cloud Foundry will keep directing new app instances and new web requests into that degraded AZ. These requests will be slow or fail. This makes Cloud Foundry partly down for its users, and right now there are few good options.

What can CF operators do right now?

Neither of the options we're aware of are very good.

Very slow: You can edit the CF manifest and do a new BOSH deploy that doesn't have VMs in the affected AZ. This is far too slow as the BOSH deploy could take an entire day for the largest CF platforms.

Slow/manual: You can choose to manually block all network traffic into the degraded Availability Zone, for instead using firewall rules. This is the approach being used by GOV.UK PaaS. This has the advantage of being very simple, but it's not seen as automated or fast enough for SAP's needs.

What do you propose?

At SAP, we think the best solution is for each VM to monitor its health. For instance an operator could configure a list of network checks. If too many of the checks fail, the VM would drain itself and kill the BOSH agent. This could also be part of Diego, and trigger a call to Rep's evacuate endpoint.

This solution can cope with more just degraded AZs, as it would drain individual degraded servers (e.g. failing racks.)

Badly chosen checks could make cells drain themselves wrongly and lead to CF downtime. This would probably be an optional feature and so the CF operator would be able to choose good network resources to check (e.g. a combination of the CF API, S3, etc.)

Tag your pair, your PM, and/or team!

Working on this with @h0nlg at SAP. Briefly talked about this with @rkoster and @AP-Hunt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions