Senior HPC Cluster Engineer
The company
Nebius AI is an AI cloud platform with one of the largest GPU capacities in Europe. Launched in November 2023, the Nebius AI platform provides high-end, training-optimized infrastructure for AI practitioners. As an NVIDIA preferred cloud service provider, Nebius AI offers a variety of NVIDIA GPUs for training and inference, as well as a set of tools for efficient multi-node training.
Nebius AI owns a data center in Finland, built from the ground up by the company’s R&D team and showcasing our commitment to sustainability. The data center is home to ISEG, the most powerful commercially available supercomputer in Europe and the 16th most powerful globally (Top 500 list, November 2023).
Nebius’s headquarters are in Amsterdam, Netherlands, with teams working out of R&D hubs across Europe and the Middle East.
Nebius AI is built with the talent of more than 500 highly skilled engineers with a proven track record in developing sophisticated cloud and ML solutions and designing cutting-edge hardware. This allows all the layers of the Nebius AI cloud – from hardware to UI – to be built in-house, distinctly differentiating Nebius AI from the majority of specialized clouds: Nebius customers get a true hyperscaler-cloud experience tailored for AI practitioners.
The role
We’re looking for a Senior HPC Cluster Engineer to contribute to the development of our hyperscaler platform.
The Hypervisor team supports and develops the parts of the Cloud platform that directly affect the KVM hypervisor and QEMU device emulator. We understand the granular details of hardware virtualization and device emulation, paying close attention to performance and protection against untrusted code.
In this position, your responsibility will be to:
- Improve infrastructure around GPU-accelerated computing
- Analyze root cause and suggest corrective action for problems large and small scales
- Add new hardware support through all infrastructure software stack
- Detect and fix problems before they occur
We expect you to have:
- 5+ years of professional software development experience
- 3+ years of experience with Linux
- System level understanding of server architecture, PCIe devices, NICs, Linux OS and kernel drivers
- Strong knowledge in any performance programming languages (C, C++, Go, Java, Python)
It would be an added bonus if you had:
- Experience analyzing and tuning performance for a variety of HPC workloads
- Familiarity with RDMA, RoCE, Infiniband
- Background with Software Defined Networking and HPC cluster networking
- Understanding of QEMU/KVM virtualization stack
- Familiarity with deep learning frameworks like PyTorch and TensorFlow
- Familiarity with collective communication libraries (MPI, nccl)
Does all that sound like your kind of challenge? Then join us!
Apply for this job
*
indicates a required field