Catapult's consultants reviewed the existing monitoring and support processes and identified two main problems.
- The monitoring solution (Elastic / ELK stack) was an old version of the products. It was collecting only a limited amount of data and had no dashboards. It had no infrastructure or middleware data and it was very difficult to identify the causes of any errors.
- The support team did not have enough data to identify and diagnose the root causes of any problems. This lead to the same issues reoccurring, causing frustration between the development and support teams.
Catapult recommended moving to an observability model that involved getting logs, metrics and traces into the Elastic stack. Our engineers worked to upgrade the tools and added collectors to get infrastructure, database and queue log data into the monitoring tool. We worked with the development and support teams to identify key metrics and Service Level Agreements (SLAs) and then set up rules, and alerts for when these are breached. Dashboards were created to visualise service health and performance.
We also set up a runbook framework for the support team to use for when they got an alert or an incident occurred. These provided steps to fix problems, along with processes to prevent recurrence.
There was an immediate impact once the additional monitoring was in place:
- Fewer incidents due to proactive alerting
- Faster Root Cause Analysis (RCA) of problems through additional data
- Improved MTTR turnaround due to runbooks
- Higher prevention of recurrence due to better RCA and collaboration between development and support team