Senior Recovery Lead and Head of Service Reliability

Overview
Join a digital first bank that''s powered by people. Our technology team builds innovative digital solutions rapidly and at scale to deliver the next generation of banking services for our customers around the world. Service Management''s purpose is to protect the availability, integrity and confidentiality of IT Services that underpin customer and colleagues experience of the HSBC brands. It is a multi-functional team comprising Change Management, Incident Management, Problem Management, Service Level Management, Outage Management, Service Recovery and Service Insights and Reporting. We are seeking a senior technology leader to take on the dual role of Senior Recovery Lead and Global Head of Service Reliability. This is a highly visible, high-impact position reporting to the Global Head of Service Management, with a mandate to transform how we recover from incidents and build long-term service resilience. This individual will lead a global team of technical experts who act as technical escalation partners during major incidents—helping reduce time to recover (TTR) through deep technical engagement, coordination, and engineering-driven solutions. Beyond recovery, this leader will also own the strategic and tactical roadmap for building reliable, self-healing systems through collaboration with Problem Management, SRE, and Platform teams.Responsibilities
Incident Recovery LeadershipLead a global, follow-the-sun team that acts as technical escalation partners during major incidents.Partner with Incident Managers and Service Owners to accelerate incident diagnosis and resolution, reducing TTR and restoring services quickly and safely.Bring calm, coordination, and engineering clarity to high-pressure recovery efforts.Root Cause Ownership and Long-Term ResilienceCollaborate with Problem Managers, Product SRE, and Platform Engineering teams to identify and eliminate systemic causes of major incidents.Own and drive long-term remediation plans, including automation, reliability engineering, and platform guardrails to reduce future risk.Track and govern follow-up actions to ensure completeness, accountability, and measurable reduction in incident recurrence.Service Reliability Engineering StrategyDefine and implement strategies for resilience engineering, including self-healing capabilities, automation of recovery workflows, and risk mitigation patterns.Advocate for operational excellence by embedding reliability standards, testing practices, and continuous improvement processes into engineering workflows.Partner with Architecture and Engineering leaders to influence system design with reliability in mind.Scenario Planning and Recovery PreparednessOwn the global incident scenario planning framework, ensuring that Technology is prepared to recover from widespread, complex failures.Design and run mass recovery simulations, chaos testing, and resilience drills to expose weaknesses and improve readiness.Work with regional and global risk teams to align with regulatory and operational resilience requirements.Leadership, Influence and CultureBuild, scale, and lead a high-performing global team with deep technical skills and a culture of urgency, ownership, and collaboration.Drive a blameless, learning-focused culture that emphasizes root cause thinking, accountability, and continuous improvement.Act as a trusted partner and thought leader across Engineering, Infrastructure, Risk, and Service Management functions.Proven experience in Site Reliability Engineering, Infrastructure, DevOps, or Technical OperationsDemonstrated experience leading global technical teams in complex, high-scale environments.Deep expertise in incident recovery, automation, systems design, and platform reliability.Strong working knowledge of problem management, root cause analysis frameworks, and resilience engineering principles.Experience designing and running resilience exercises, chaos engineering, or incident scenario testing at scale.Comfortable operating in regulated environments and partnering with Risk and Compliance functions.Excellent stakeholder management and communication skills, with the ability to lead through influence at senior levels.Technical Depth - Ability to dive deep across infrastructure, applications, and cloud-native architectures.Recovery Leadership - Skilled in coordinating technical resources under pressure to resolve incidents rapidly.Reliability Thinking - Strategic mindset focused on system robustness, automation, and prevention.Change Agent - Drives cultural and engineering change to improve stability and accountability.Cross-Functional Collaboration - Adept at aligning goals and actions across engineering, operations, and risk domains.
#J-18808-Ljbffr
Other jobs of interest...
Perform a fresh search...
-
Create your ideal job search criteria by
completing our quick and simple form and
receive daily job alerts tailored to you!