Server restart test automation
Server restart test automation
The History
Until 2020, the concept of “test automation” referred to manually scheduling restoration and restart tests for our Clients. This meant an Operator had to manually specify the day and time for a test on a given machine.
By 2020, we were already counting several hundred servers that needed to be tested regularly, typically involving the restoration of multiple disks. This generated a significant amount of planning and monitoring work for Nuabee’s technical teams.
We also wanted to increase the frequency of server restart tests to improve the quality of the Nuabee solution.
The goals set were:
- To allow Operational teams to save time on this planning through complete automation. This system would autonomously choose the best time to restore & restart a server, based on multiple criteria, requiring an operator’s attention only in case of a problem.
- To optimize the costs of server restart tests while increasing their frequency. This second objective also aims to reduce risks during a DRP test by testing client servers more regularly.
Industrialization of Restoration Automation
Firstly, we asked ourselves: What does automation entail?
This term, though daunting at first, makes sense when we detail its principles in our context.
Here, we aim to ensure clients’ servers are restored and restarted as often as possible, and also at the best possible time. Indeed, “time” must be allotted for each server to prevent overlap in server tests. To illustrate, we want to avoid a scenario where one kitten from a litter eats all the other kittens’ food! In this case, the time allocated to each server test must be optimized to ensure an ongoing restoration is executed without error, regardless of the day or time.
The second important concept to consider is the idea of “weight”. If you’re familiar with Google’s PageRank system, then you’ve already grasped part of the project. According to Google, a website has a notion of weight, meaning a value that can increase or decrease. The higher the value, the more important the website.
Here, we apply the same principle, but to servers: the higher a server’s weight, the more important it is in the restart test process.
Considering this concept of weight, let’s detail the criteria we decided to apply to define which server is more important than another and should therefore be restored and restarted first:
- Servers NEVER restored (those of new clients) have a base weight of 360.
- Servers that have had a successful restoration (i.e., successful test) have a base weight of 0.
- Servers that have had a restoration but failed (i.e., failure) take on a weight of -1. They become blacklisted and must be analyzed by a Human Operator. Once the issue has been resolved, the operator can reintroduce it into the restoration pool.
- Machines that have had a priority restoration request following a manual request by an Operator, take a base weight of 1000.
Increasing the frequency of server restart tests
To achieve the second objective, we opted to rent a dedicated instance for these tests (or several if needed) with continuous use. This corresponds to instance reservation over several years, allowing us to reduce the costs of restart tests.
Thanks to this modularity, we ensure cost optimization in real-time. You can see this as in rail management: we can choose to open or close tracks based on traffic.
As for the industrial logic this implies, we faced several major principles that had to be made code-independent but also compatible.
In Summary
Thanks to these mechanisms, we have quadrupled the number of client server restart tests without a significant increase in costs.
However, we must find the right balance between the quality of our solution and the carbon footprint of the tests. These studies are currently a focus at Nuabee, and we will soon return with our research and findings on this topic.