congestion_control::Component (aka CC) limits the active requests count. CC has a RPS (request per second) limit mechanism that turns on and off automatically depending on the main task processor workload. In case of overload CC responds with HTTP 429 codes to some requests, allowing your service to properly process the rest. The RPS limit is determined by a heuristic algorithm inside CC. All the significant parts of the component are configured by dynamic config options USERVER_RPS_CCONTROL and USERVER_RPS_CCONTROL_ENABLED.
CC can run in fake-mode with no RPS limit (but FSM works). CC goes into fake-mode in the following cases:
fake-mode can be useful for more flexible traffic restriction settings, according to it's more complex logic, which can be implemented in a middleware.
congestion_control::Component can be useful if your service stops handling requests when overloaded, significantly increasing response time, responding with HTTP 500 codes to requests, eating memory.
Including CC in your service will help you handle some reasonable request flow returning HTTP error codes to the rest.
congestion_control::Component cannot be useful if:
true the value of USERVER_RPS_CCONTROL_ENABLED dynamic config.It is a good idea to disable congestion_control::Component in unit tests to avoid getting HTTP 429 on an overloaded CI server.
The Congestion Control logic is implemented as sensors (overloads_ps, rps) and a state machine with variables.
Sensor Data:
overloads_ps – Number of tasks in the last second that waited in the execution queue for more than USERVER_TASK_PROCESSOR_QOS.default-service.default-task-processor.sensor_time_limit_us microseconds (default: 3ms).rps – Number of requests received in the last second.State Machine Variables:
is_overloaded – Whether the server is stably under unsustainable load. -is_overloaded_now– Whether the server is currently under unsustainable load. -current_limit` – Current RPS limit.The congestion control state machine has 5 states:
current_limitis_overloaded=true, is_overloaded_now=true – The service is currently overloaded and has been overloaded for a long time. Congestion Control decreases service RPS limit by down_rate_percent%.is_overloaded=true, is_overloaded_now=false – The service is not overloaded now but was overloaded very recently.is_overloaded=false, is_overloaded_now=true – The service is currently overloaded but was minimally overloaded recently.is_overloaded=false, is_overloaded_now=false – The service is not overloaded now and was minimally overloaded recently. Congestion Control increase service RPS limit by up_rate_percent%.Semantics of the is_overloaded flag:
is_overloaded=true, the service either decreases the RPS limit or waits (depending on is_overloaded_now).is_overloaded=false, the service either increases the RPS limit or waits (depending on is_overloaded_now).Key State Transitions:
overloads_ps <= down-level holds for overload-off seconds (i.e., the service experienced no overload for overload-off consecutive seconds).overloads_ps > up-level holds for overload-on seconds (i.e., the service experienced overload for overload-on consecutive seconds).Transitions between is_overloaded_now states:
is_overloaded=true, compute is_overloaded_now = overloads_ps > down-level.is_overloaded=false, compute is_overloaded_now = overloads_ps > up-level.Purpose of splitting is_overloaded and is_overloaded_now: This design mitigates flapping. Services may frequently execute isolated long tasks that monopolize the task processor for significant periods without critically impacting performance (wait times are negligible). Simultaneously, it ensures progressive limit increases when many tasks experience low wait times.
In some situations default settings are ineffective. For example:
in those situations congestion_control::Component settings need adjusting.
Basic dynamic configuration options:
default-service.default-task-processor.wait_queue_overload.sensor_time_limit_us. This setting defines wait in queue time after which the overload events for RPS congestion control are generated. It is recommended to set this setting >= 2000 (2 ms) because system scheduler (CFS) time unit by default equals 2 ms.In case RPS mechanism is triggered it is recommended to ensure that there is no mistake. If RPS triggering coincided with peak CPU consumption than there is no mistake and the lack of resources situation needs to be resolved:
If RPS triggering did not coincide with peak CPU consumption than there is no lack of resources but a different kind of problem. Most likely your service has synchronous operations that block the coroutine flow. If this is the case then you need to either: