IT crisis management in the banking sector
Philippe Desforges, CIO in Transition, gives us an account of three major IT incidents that marked his career in the banking sector, revealing essential lessons for any organization wishing to strengthen its resilience.
Fire crisis: when the data center catches fire
At the end of August 2005, it was 5 p.m. in the La Défense offices when the alarm sounded: a fire had destroyed the main Lognes data center. "We were neither warned, nor trained, nor ready," confides Philippe Desforges. The fire department flooded the technical rooms. Part of the team has to stay on site to manage the crisis and the recovery once the rooms are dry and we can restart the data center. At 8pm, the power is finally restored: this is the starting signal for a long night of rebooting. Several hundred servers are reactivated according to priority - first account management, then international payments - under the direction of a handful of volunteers. No dinner break, no possibility of going out (as the doors to the La Défense building were locked from 8pm): the team, hungry and exhausted, battled against sleep and tension until the early hours of the morning.
In these crisis situations, you need to know your team's strengths: who can hold up under pressure, who can keep a cool head, who can revive scripts without flinching," explains Philippe.
Key lessons
- Knowing and valuing key skills: identifying employees who can handle the pressure and keep up the effort right to the end.
- Clear leadership: serene management is vital to maintaining cohesion and efficiency, which in turn enables roles to be allocated and encourages volunteers.
- Collective resilience: spending the night without eating or going out, in a locked building, requires shared mental strength; team spirit then becomes the driving force behind recovery.
"When you get on well in a team because you're a good manager and you avoid having conflicts with people and you work with trust, the day it goes wrong, our employees aren't panicked. The team counts for a lot!"
Software malfunctions and inadequate testing: when a test error becomes a threat
Another critical episode, again at a bank. "At the time, I was in charge of research into international transfers. I managed the migration of our mainframe platform to a new multi-CPU machine, touted as "four times more powerful". Backed by laboratory benchmarks on limited volumes, production deemed load tests in real-life conditions unnecessary: "The machine performs better, why bother?"
- D-Day, Monday morning - The switchover goes off without a hitch and the first batches of transactions are processed.
- 10 a.m. - The critical Monday morning batch is almost 30% behind schedule.
- 12 h - Only half the usual volume has been processed; corporate customers, dependent on these flows for their bulk transfers, are threatening to leave the bank.
- Monday evening and Tuesday - A parade of crisis committees: developers and infrastructure seek in vain to optimize code and configurations.
- Wednesday - Despite successive tinkering, performances remain disappointing; pressure mounts on all sides.
- Thursday morning - After three days of blocking, a real-life load test reveals the cause: the OS only runs on one of the four cards. This means that the operating system only runs on one card at a time, even if there are four.
- Thursday afternoon - Decision taken to uninstall the "super machine" during the next weekend shutdown, and replace it with a server already tried and tested in pre-production.
- Weekend - In two days, we dismantle the multi-CPU machine, install the backup Unix machine, reload the data, and redo all the tests.
- The following Monday - The platform restarts on the temporary machine, and flows immediately return to normal.
After these events, of course, you need to take the time to thank the teams, give them time to breathe, and take a day off the next day. You need to call the people in charge of the service providers to underline their involvement and good work.
Impact and pressure
The repercussions of the crisis were felt at various levels:
- Customers and reputation: the blocking of wire transfers led several companies to use other service providers, undermining both customer confidence and HSBC's image of solidity.
- Operational: the daily opening of crisis committees, saturated help desks and short nights created an atmosphere of tension and general exhaustion.
- Hierarchy: an investigation was launched by HSBC London in the form of a Major Incident Report, proposing improvements in architecture governance.
Key lessons
Lessons to be learned from this crisis:
- Never rule out target testing: manufacturer documentation is no substitute for full-scale testing, with production volumes and scenarios.
- Involve business and infrastructure from the outset: co-construct test scenarios and share success criteria to avoid grey areas.
- Plan B and reversibility: anticipate a proven backup solution, plan its rapid implementation, and repeat procedures during the test phases.
- Acknowledging team effort: after a week of tension and sleepless nights, a word of thanks, highlighting successes, or even a symbolic reward, strengthens cohesion and prepares the team for the next challenge.
The Pre-Christmas Transfer Incident: when pressure turns a mistake into a national crisis
Another crisis in the banking world during the Christmas period. Bank branches are swamped with private customers making transfers, especially for Christmas presents. That morning, the bank's President went to a branch to transfer money to his children or grandchildren. To his surprise, the transfer failed. His wrath was immediately relayed to management, and the IT team was put under extreme pressure to rectify the problem in record time.
Course of the crisis
- Express" patch and head office tests
To "reassure" the highest level, a patch is developed immediately and tested in a few hours under head office "office" conditions, according to a linear scenario where the transfer is entered, validated and submitted without interruption. - Crucial omission of field tests
No tests were carried out in branches, with real counter staff using the application: navigation between several screens, frequent backtracking, reopening of forms already entered... - Immediate effect in production
When a customer makes his first transaction, each transfer is duplicated: side by side, two identical debits appear. The cause? An ill-adjusted loop of code which, when a user returns to the confirmation screen, sends back two calls to the server. In just a few hours, hundreds of duplicate transfers generated over €80 million in transactions (€40 million net loss). - Diagnosis by field reproduction
Only testing in real branch conditions can reveal the bug: by reproducing the exact route taken by a counter clerk, the team identifies the double execution of the validation loop. - Correction and investigation
A new patch is designed and deployed today. The British Major Incident Report highlights the lack of field testing and the haste dictated by pressure.
Key lessons
- Don't panic
- Carry out an exhaustive business acceptance test in the field
A patch validated under controlled conditions may fail in the face of real user practices. It is imperative to test in branches, with real counter staff and their non-linear paths (returns, multi-screens, interruptions). - Unalterable validation process
Even under high pressure, follow a structured acceptance plan, document each scenario and require formal validation by end-users before deployment. - Integrate the complexity of uses right from the design stage
Anticipate and model, right from the specification phase, all alternative and "out-of-process" scenarios for counter operators, to ensure that the patch actually works in production.
"It's not just a question of code, it's a question of understanding actual usage and being able to adjust our procedures accordingly," concludes Philippe.
Conclusion: transforming crises into opportunities for improvement
Philippe Desforges' three experiences demonstrate that, in the banking sector, a crisis is more than just a technical failure; it also reveals shortcomings in communication, governance and preparation. To take advantage of these emergency situations, it is essential to anticipate weak signals and invest in reliable redundancy systems and appropriate business continuity plans. Carry out real-life tests to bridge the gap between theory and practice. Strengthen communication between all players, whether technical or operational, to ensure a collective and coordinated response. Demonstrate leadership and calm, capable of transforming the crisis experience into a lever for continuous improvement for the entire organization.
As Philippe Desforges points out,
"A good crisis manager isn't just a technical expert, he's above all a leader capable of uniting his teams and transforming every challenge into an opportunity for progress."
Through his feedback, he reminds us that a crisis, though unpredictable and trying, is also an opportunity to rethink our methods, improve coordination between teams and build more robust systems for the future.