A system could not achieve high availability if it does not tolerate failures. In microservices and other distributed network architectures the hardware and network itself is the source of failures.
Luckily userver provides a network proxy to simulate network errors and whose behavior could be controlled from functional tests. In other words, it is a tool to do chaos testing.
To implement the tests, the pytest_userver.chaos.TcpGate and pytest_userver.chaos.UdpGate are placed between client and server. After that, the proxy could be used to test the client and the server fault tolerance. The following network conditions could be simulated:
Network condition | Possible reasons for the behavior in production |
---|---|
Server doesn't accept new connections | Server is overloaded; network misconfigured |
Client/server does not read data from socket | Client/server is overloaded; deadlock in client/server thread; OS lost the data on socket due to overload |
Client/server reads data, but does not respond | Client/server is overloaded; deadlock in client/server thread |
Client/server closes connections | Client/server restart; client/server overloaded and shutdowns connections |
Client/server closes the socket when receives data | Client/server congestion control in action |
Client/server corrupts the response data | Stack overflow or other memory corruption at client/server side |
Network works with send/receive delays | Poorly performing network; interference in the network |
Limited bandwidth | Poorly performing network; |
Connections close on timeout | Timeout on client/server/network interactions from client/server provider |
Data slicing | Interference in the network; poor routers configuration |
Keep in mind the following picture:
For client requests:
For server responses:
Gates allows you to simulate various network conditions on client
and server
parts independently. For example, if you wish to corrupt data from Client to Server, you should use gate.to_server_corrupt_data()
. If you wish to corrupt data that goes from Server to Client, you should use gate.to_client_corrupt_data()
.
To restore the non failure behavior use gate.to_client_pass()
and gate.to_server_pass()
respectively.
Use chaos.TcpGate when you need to deal with messages over TCP protocol (for example HTTP). This gate supports multiple connections from different clients to one server. TcpGate accepts incoming connections which trigger swapning of new clients connected to server.
Use chaos.UdpGate when you work with
UDP protocol (for example DNS). UdpGate supports data providing between only one client and one server. If you wish to connect more than one Client to Server, please use different instance of UdpGate for every client.
Import chaos
to use chaos.TcpGate:
The proxy usually should be started before the service and should keep working between the service tests runs. To achieve that make a fixture with the session scope and add it to the service fixture dependencies.
An example how to setup the proxy to stand between the service and PostgreSQL database:
The proxy should be reset to the default state between the test runs, to the state when connections are accepted and the data pass from client to server and from server to client:
After that, everything is ready to write chaos tests:
Note that the most important part of the test is to make sure that the service restores functionality after the failures. Touch the dead connections and make sure that the service restores: