Github   Telegram
Loading...
Searching...
No Matches
Chaos Testing

A system could not achieve high availability if it does not tolerate failures. In microservices and other distributed network architectures the hardware and network itself is the source of failures.

Luckily userver provides a network proxy to simulate network errors and whose behavior could be controlled from functional tests. In other words, it is a tool to do chaos testing.

To implement the tests, the pytest_userver.chaos.TcpGate is placed between client and server. After that, the proxy could be used to test the client and the server fault tolerance. The following network conditions could be simulated:

Network condition Possible reasons for the behavior in production
Server doesn't accept new connections Server is overloaded; network misconfigured
Client/server does not read data from socket Client/server is overloaded; deadlock in client/server thread; OS lost the data on socket due to overload
Client/server reads data, but does not respond Client/server is overloaded; deadlock in client/server thread
Client/server closes connections Client/server restart; client/server overloaded and shutdowns connections
Client/server closes the socket when receives data Client/server congestion control in action
Client/server corrupts the response data Stack overflow or other memory corruption at client/server side
Network works with send/receive delays Poorly performing network; interference in the network
Limited bandwidth Poorly performing network;
Connections close on timeout Timeout on client/server/network interactions from client/server provider
Data slicing Interference in the network; poor routers configuration

chaos.TcpGate

Keep in mind the following picture:

Client <-> (client) chaos.TcpGate (server) <-> Server

chaos.TcpGate allows you to simulate various network conditions on client and server parts independently. For example, if you wish to corrupt data from Client to Server, you should use gate.to_server_corrupt_data(). If you wish to corrupt data that goes from Server to Client, you should use gate.to_client_corrupt_data().

To restore the non failure behavior use gate.to_*_pass().

Usage Sample

Import chaos to use chaos.TcpGate:

from pytest_userver import chaos

The proxy usually should be started before the service and should keep working between the service tests runs. To achieve that make a fixture with the session scope and add it to the service fixture dependencies.

An example how to setup the proxy to stand between the service and PostgreSQL database:

@pytest.fixture(name='for_clinet_gate_port', scope='session')
def _for_clinet_gate_port():
return 11433
@pytest.fixture(scope='session')
def pgsql_local(service_source_dir, pgsql_local_create):
databases = discover.find_schemas(
'pg', [service_source_dir.joinpath('schemas/postgresql')],
)
return pgsql_local_create(list(databases.values()))
@pytest.fixture(scope='session')
async def _gate_started(loop, for_clinet_gate_port, pgsql_local):
gate_config = chaos.GateRoute(
name='postgres proxy',
host_for_client='localhost',
port_for_client=for_clinet_gate_port,
host_to_server='localhost',
port_to_server=pgsql_local['key_value'].port,
)
async with chaos.TcpGate(gate_config, loop) as proxy:
yield proxy
@pytest.fixture
def client_deps(_gate_started, pgsql):
pass

The proxy should be reset to the default state between the test runs, to the state when connections are accepted and the data pass from client to server and from server to client:

@pytest.fixture(name='gate')
async def _gate_ready(service_client, _gate_started):
_gate_started.to_server_pass()
_gate_started.to_client_pass()
_gate_started.start_accepting()
await _gate_started.wait_for_connections()
yield _gate_started

After that, everything is ready to write chaos tests:

async def test_pg_congestion_control(service_client, gate):
gate.to_server_close_on_data()
gate.to_client_close_on_data()
for _ in range(gate.connections_count()):
response = await service_client.get('/chaos/postgres?type=fill')
assert response.status == 500
await _check_that_restores(service_client, gate)

Note that the most important part of the test is to make sure that the service restores functionality after the failures. Wait for connections to proxy make and sure that the service responses are fine:

async def _check_that_restores(service_client, gate):
gate.to_server_pass()
gate.to_client_pass()
gate.start_accepting()
try:
await gate.wait_for_connections(timeout=30.0)
except asyncio.TimeoutError:
assert False, 'Timout while waiting for restore'
assert gate.connections_count() >= 1
for _ in range(10):
response = await service_client.get('/chaos/postgres?type=fill')
if response.status == 200:
return
await asyncio.sleep(0.75)
assert False, 'Bad results after connection restore'

Full Sources