Job Application for Senior Site Reliability Engineer at Nebius

New

About Nebius AI

Nebius AI is an AI cloud platform with one of the largest GPU capacities in Europe. Launched in November 2023, the Nebius AI platform provides high-end, training-optimized infrastructure for AI practitioners. As an NVIDIA preferred cloud service provider, Nebius AI offers a variety of NVIDIA GPUs for training and inference, as well as a set of tools for efficient multi-node training.

Nebius AI owns a data center in Finland, built from the ground up by the company’s R&D team and showcasing our commitment to sustainability. The data center is home to ISEG, the most powerful commercially available supercomputer in Europe and the 16th most powerful globally (Top 500 list, November 2023).

Nebius’s headquarters are in Amsterdam, Netherlands, with teams working out of R&D hubs across Europe and the Middle East.

Nebius AI is built with the talent of more than 500 highly skilled engineers with a proven track record in developing sophisticated cloud and ML solutions and designing cutting-edge hardware. This allows all the layers of the Nebius AI cloud – from hardware to UI – to be built in-house, distictly differentiating Nebius AI from the majority of specialized clouds: Nebius customers get a true hyperscaler-cloud experience tailored for AI practitioners. We’re growing and expanding our products every day.

About the role

Nebius is looking for a Senior Site Reliability Engineer. You’re welcome to work in our offices in Amsterdam or Belgrade.

Your responsibilities will include:

Ensure fault-tolerance, scale, and uninterrupted operations for the service.
Use cutting-edge cloud technology to solve a variety of infrastructure problems.
Implement and improve CI/CD processes.

We expect you to have:

Solid experience with programming languages (like Go, Python, or C++);
Solid understanding of classic algorithms and data structures;
Commercial experience with and deep understanding of Unix systems and network technology;
Experience with systems for containerization and configuration management (Ansible, Salt, Terraform, Docker, K8s, Helm).

It will be an added bonus if you have:

Desire to be involved in backend development;
Experience designing, developing, and running high-load distributed systems;
Commercial experience with a variety of cloud platforms.

If you’re up to the challenge and are excited about AI and ML as much as we are, join us!

Senior Site Reliability Engineer

Apply for this job