βββββββββββββββββββββββββββ
β β β β
β βββββ βββββ βββββ β
β ββ΄ββββ΄βββ΄ββββ΄βββ΄ββββ΄β β
β β β‘οΈ ββ β‘οΈ ββ β‘οΈ β β
β β βββ βββββββββ βββ β β
β β β β ββββ¬ββ¬βββ β β β β
β βββ΄ββ΄βββ β β ββββ΄ββ΄ββ β
β ββββββββ βββ ββββββββ β
β ββ¬ββββ¬βββ¬ββββ¬βββ¬ββββ¬β β
β βββββ βββββ βββββ β
β β β β
ββββββ¬ββ¬βββββ¬ββ¬βββββ¬ββ¬βββββ
ββ¬β ββ¬β ββ¬β
ββ β² ββ
ββββββββ΄ββββββ΄ββββββ΄βββββββ
β Breaker Box β
βββββββββββββββββββββββββββ
A collection of the circuit breaker pattern implemented in different programming languages.
A circuit breaker automatically detects failures in dependencies of a system. When a dependency is detected as failing, the circuit breaker prevents traffic from being sent to the service, while periodically sampling a small percentage of traffic to verify if the dependent service has recovered. This prevents cascading failures that can be caused by continuing to send traffic to a failing service.
A circuit breaker can help protect systems against:
- retry storms caused by services that retry failing requests
- continuing to add load on a service that is exhausted of resources
It also provides other benefits such as:
- avoiding unnecessary latency by failing fast
- signalling to dependent services that the service is unavailable so they can respond appropriately
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
β β β β β β
β Request β β Circuit Breaker β β Service β
β β β β β β
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
β β β
β β β
Normal Closed State β β β β β β βββ β β β β β β β β β β βββ β β β β β
β β β
β β β βββββββββββββββ β β
β β β State = β β
β ββββββββββββββββββββββββΆββ ββ Closed β β β
β β βββββββββββββββ β
β β β β β
β ββββββββββββββββββββββββΆβ
β β β β β
β β Response β
β β βββββββββββββββββββββββββ€ β
β β β
β β Response β β β
βββββββββββββββββββββββββ€ β
β β β β β
β β β
β β β β β
β β β β β¬ β β β β β β β β β β β β¬ β β β β β β β β β β β β¬ β β β β β
β β β
Failing Closed Stateβ β β β β β βββ β β β β β β β β β β βββ β β β β β
β β β
β β β βββββββββββββββ β β
β β β State = β β
β ββββββββββββββββββββββββΆββ ββ Closed β β β
β β βββββββββββββββ β
β β β β β
β ββββββββββββββββββββββββΆβ
β β β β β
β β Error β
β β βββββββββββββββββββββββββ€ β
β ββ β
β β Error β β βββββββββββββββ β β
βββββββββββββββββββββββββ€ β Increment β β
β β β β ββ Error Count β β β
β β βββββββββββββββ β
β β β β β
β β β β β¬ β β β β β β β β β β β β¬ β β β β β β β β β β β β¬ β β β β β
β β β
Transition Open State β β β β β βββ β β β β β β β β β β βββ β β β β β
β β β
β β β βββββββββββββββ β β
β β β State = β β
β ββββββββββββββββββββββββΆββ ββ Closed β β β
β β βββββββββββββββ β
β β β β β
β ββββββββββββββββββββββββΆβ
β β β β β
β β Error β
β β βββββββββββββββββββββββββ€ β
β ββ β
β β Unavailable β β ββββββββββββββββββββ΄ββββββ β
βββββββββββββββββββββββββ€ βError Count > Threshold?β
β β β β ββ Set State = Open β β
β β β Set Timeout β
β β β ββββββββββββββββββββ¬ββββββ β
β β β β β¬ β β β β β β β β β β β β¬ β β β β β β β β β β β β¬ β β β β β
β β β
Open Stateβ β β β β β β β β β β βββ β β β β β β β β β β βββ β β β β β
β β β
β β β βββββββββββββββ β β
β β β State = β β
β ββββββββββββββββββββββββΆββ ββ Open β β β
β β βββββββββββββββ β
β β Unavailable β β β
βββββββββββββββββββββββββ€ β
β β β β β
β β β
β β β β β
β β β β β¬ β β β β β β β β β β β β¬ β β β β β β β β β β β β¬ β β β β β
β β β
Transition Half Open Stateβ β β βββ β β β β β β β β β β βββ β β β β β
β β β
β β β βββββββββββββββββββββ΄βββ β
β β β State = Open β
β ββββββββββββββββββββββββΆββ ββ Timeout = Expired β β
β β βSet State = Half Open β
β β β βββββββββββββββββββββ¬βββ β
β ββββββββββββββββββββββββΆβ
β β β β β
β β β β β β β β β β β΄ β β β β β β β β β β
β β β β
β β Sample traffic to evaluate if β β
β β service is still unavailable. β β
β β See below examples β β
β β β β
β β β β β β β β β β β¬ β β β β β β β β β β
β β β β β
β β β β β¬ β β β β β β β β β β β β¬ β β β β β β β β β β β β¬ β β β β β
β β β
Half Open State Failβ β β β β β βββ β β β β β β β β β β βββ β β β β β
β β β
β β β βββββββββββββββ β β
β β β State = β β
β ββββββββββββββββββββββββΆββ ββ Half Open β β β
β β βββββββββββββββ β
β β β β β
β ββββββββββββββββββββββββΆβ
β β β β β
β β Error β
β β βββββββββββββββββββββββββ€ β
β ββ β
β β Unavailable β β ββββββββββββββββββ β β
βββββββββββββββββββββββββ€ βSet State = Openβ β
β β β β ββ Set Timeout β β β
β β ββββββββββββββββββ β
β β β β β
β β β β β¬ β β β β β β β β β β β β¬ β β β β β β β β β β β β¬ β β β β β
β β β
Half Open State Success β β β β βββ β β β β β β β β β β βββ β β β β β
β β β
β β β βββββββββββββββ β β
β β β State = β β
β ββββββββββββββββββββββββΆββ ββ Half Open β β β
β β βββββββββββββββ β
β β β β β
β ββββββββββββββββββββββββΆβ
β β β β β
β β Response β
β β βββββββββββββββββββββββββ€ β
β ββ ββββββββββββββββββββ΄βββββββββ
β β Response β β β Increment Consecutive β β
βββββββββββββββββββββββββ€ β Success Count β
β β β β ββ Success threshold met? β β
β β β Set State = Closed β
β β β ββββββββββββββββββββ¬βββββββββ β
β β β β β¬ β β β β β β β β β β β β¬ β β β β β β β β β β β β¬ β β β β β
β β β
β β β
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
β β β β β β
β Request β β Circuit Breaker β β Service β
β β β β β β
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
Notes on Half Open behavior: For our implementation, we transition the circuit breaker from Open to HalfOpen after a set duration the circuit has been open. When in the HalfOpen state, the circuit routes traffic to the dependent service. If N number of consecutive successes occur, the circuit will move to Closed. If any errors occur the circuit will move back to Open and the wait duration for the next HalfOpen attempt is reset.
I wasn't satisfied with existing circuit breaker implementations as their approaches for detecting if a service was unavailable were too simplistic eg using a fixed error count threshold or time since last successful response.
For this implementation my goals were to be able to:
- intelligently distinguish between error increases due to traffic spikes and genuine service degradation
- respond quickly to service degradations
- configure the sensitivity of what causes the circuit breaker to trip. For example a 5% error rate in a service may want to trip the circuit, however a less critical service may want to wait until a 40% error rate.
- Record events within a configurable time window
- Automatically expire data as the time window shifts
- This sliding window approach means we can detect changes quickly and accurately, as we are always retaining the short term history data
- The circuit breaker may need to process and store hundreds of thousands of events for high traffic services, so the solution must be optimised
Active Cursor β β β β β Node
βββββββββββββββββββ βββββββββββββββββββ
β β ErrorCount int β β β ErrorCount int β β
β TotalCount int ββββββββββΆβ TotalCount int β β
β β Expires time β β β Expires time β β
βββββββββββββββββββ βββββββββββββββββββ β
β β β β β ββ²β β β β β β β β
β β β
Node β Node βΌ β
β² βββββββββββββββββββ βββββββββββββββββββ β
β β ErrorCount int β β ErrorCount int β β
β β TotalCount int β β TotalCount int β β
β β Expires time β β Expires time β β
β βββββββββββββββββββ βββββββββββββββββββ β
β β² β β
β β β β
β Node β Node βΌ β
β βββββββββββββββββββ βββββββββββββββββββ β
β β ErrorCount int β β ErrorCount int β β
β β TotalCount int β ββββββ β TotalCount int β β
β β Expires time β β Expires time β β
β βββββββββββββββββββ βββββββββββββββββββ β
β β
βββββββββββββββββββEvaluation Windowβββββββββββββββββββββ
The approach I chose was to use a fixed size, time based ring buffer. A ring buffer is a linked list, where the last item(tail) has a pointer to the head. In most cases we can make use of languages inbuilt libraries for achieving this, such as Goβs standard library container/ring.
The active cursor is used to store incoming events, and calculating the circuit breaker state is done by traversing all nodes, excluding the active cursor, to calculate the overall error rate based on recent history of traffic.
The fixed size, time based ring buffer is a great approach for performance. In the above diagram, updating the circuit breaker state can be done in constant time O(1). Traversing the nodes in the buffer operates in linear time O(n), however this is mitigated by the fact that the size of the ring buffer will be small. There are limited use cases where the ring would exceed 10 nodes. The ring maintains a fixed size, and incoming data simply overwrites expired data, meaning memory will not increase with additional traffic.