Taking rapid action to resolve a major IT system failure and implementation crisis
When one of the world’s leading mobile network providers consolidated billing, customer and network management into a single unified system, chaos ensued. The new platform was so unstable that they were losing 243 trading hours per week.
It didn’t take long for us to see that the new system was unstable, oversized and overly complex, resulting in unnecessary processing overheads and poor performance. The stack was also failing because requirements were not properly understood, and little testing had been done to ensure a smooth customer experience.
The Catapult Resolve team created and lead two workstreams – the first focusing on stabilisation and performance. Here, we successfully provided a number of optimisations and enhancements, including migrating Oracle databases to new SAN storage and accelerating system start-up and shutdown to decrease release cycles.
Our second workstream centred around simplification and right-sizing, where our team delivered a number of other improvements. We reorganised the server layout to boost up-time, reduced the number of hops between stack components for key instructions, and improved error handling between components by introducing defined error scenarios and testing.
Our work resulted in a dramatic improvement in performance. We freed up 42 physical machines to boost uptime, reduced message failures from 12% to 0.3% and regained 243 trading hours per week. Both stack and CRM performance improved by 30%, whilst billing system performance shot up over 1000 times. And on top of all that, the useful work being done on key databases leapt from below 50% to 90%, and the release outage window went from 21 days to just 30 minutes.