HPC SRE - Quant Research, HFT
Network Site Reliability Engineer - Python/GO, Observability, Monitoring, HPC Within the Network Engineering Team, this role is critical in ensuring our clients High-Performance Computing (HPC) environments are supported by a resilient, data-driven, and software-defined network foundation. We are seeking a Networks focused Site Reliability Engineer (SRE) with a focus on Observability, Telemetry, and Monitoring. In this role, you will apply a software engineering mindset to network operations, bridging the gap between traditional networking and modern Site Reliability Engineering (SRE). You will be responsible for ensuring our high-performance network infrastructure is not just functional, but deeply visible. You will build the tooling and automation that allow the team to move from reactive troubleshooting to proactive, automated remediation and "self-healing" infrastructure. Key Responsibilities: Reliability Engineering: Apply SRE principles to the network; define and maintain SLIs, SLOs, and Error Budgets for network latency, packet loss, and availability.HPC Connectivity andamp; Performance: Support low-latency, high-throughput network architectures (eg, RDMA, RoCE) designed for intensive HPC and financial data workloads.Advanced Telemetry: Design and manage high-cardinality telemetry pipelines to collect and analyze flow logs, metrics, and traces at scale.Network Automation (Python/Go): Build and maintain internal software tools, APIs, and "self-healing" scripts to aut
Perform a fresh search...
-
Create your ideal job search criteria by
completing our quick and simple form and
receive daily job alerts tailored to you!