Job Application for MLOps Engagement Engineer at Nebius

About Nebius AI

Launched in November 2023, the Nebius AI platform provides high-end infrastructure and tools for training, fine-tuning and inference. Based in Europe with a global footprint we aspire to become the leading AI cloud for AI practitonerts around the world.

Nebius is built around the talents of more than 500 highly skilled engineers with a proven track record in developing sophisticated cloud and ML solutions and designing cutting-edge hardware. This allows all the layers of the Nebius AI cloud – from hardware to UI – to be built in-house, clearly differentiating Nebius from the majority of specialized clouds: Nebius customers get a true hyperscaler-cloud experience tailored for AI practitioners.

As an NVIDIA preferred cloud service provider, Nebius offers the latest NVIDIA GPUs including H100, L40S, with H200 and Blackwell chips coming soon.

Nebius owns a data center in Finland, built from the ground up by the company’s R&D team. We are expanding our infrastructure and plan to add new colocation data centers in Europe and North America already this year, and to build several greenfield DCs in the near future.

Our Finnish data center is home to ISEG, the most powerful commercially available supercomputer in Europe and the 19th most powerful globally (Top 500 list, June 2024). It also epitomizes our commitment to sustainability, with energy efficiency levels significantly above the global average and an innovative system that recovers waste heat to warm 2,000 residential buildings in the nearby town of Mäntsälä.

Nebius is headquartered in Amsterdam, Netherlands, with R&D and commercial hubs across North America, Europe and Israel.

The role

We are seeking an experienced MLOps Engagement Engineer to join our team, focusing on designing, implementing and maintaining large-scale distributed machine learning (ML) training and inference workflows. Working together with our Solutions Architect and support team, MLOps Engagement Engineer provides hands-on expertise for our largest customers and internal teams alike.

The successful candidate will have a strong background in MLOps, containerization and distributed computing, with a passion for optimizing ML pipelines. If you have experience with K8S, Slurm and ML frameworks like TensorFlow, PyTorch or MXNet, and if you’re eager to jump into GPU highload workloads industry, this position is for you!

You’re welcome to work in our office in Amsterdam, or remotely.

Your responsibilities will include:

Design and implement distributed ML training and inference workflows: develop and maintain scalable, efficient and reliable ML training pipelines on K8s and Slurm, leveraging containerization (e.g. Docker) and orchestration (e.g. K8s).
Optimize ML training performance: collaborate with data scientists and engineers to optimize ML model training and inference performance.
Develop and contribute to training and inference Solutions Library: design, deploy and manage K8s and Slurm clusters for large-scale ML training, leveraging our ready-to deploy solutions.
Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch or MXNet, ensuring seamless execution of distributed ML training workloads.
Monitor and troubleshoot distributed training: develop monitoring and logging tools to track distributed training performance, identify bottlenecks and troubleshoot issues.
Develop automation scripts and tools: create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform or Python.
Stay up-to-date with industry trends: participate in industry conferences, meetups and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm and ML.

We expect you to have:

3+ years of experience in MLOps, DevOps or a related field
Strong experience with K8s and containerization (e.g. Docker)
Experience with Slurm or other distributed computing frameworks
Proficiency in Python, with experience in ML frameworks like TensorFlow, PyTorch or MXNet
Strong understanding of distributed computing concepts, including parallel processing and job scheduling
Experience with automation tools like Ansible, Terraform or Python
Excellent problem-solving skills with the ability to troubleshoot complex issues
Strong communication and collaboration skills, with experience working with cross-functional teams

It will be an added bonus if you have:

Experience with cloud providers like AWS, GCP or Azure
Knowledge of ML model serving and deployment
Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI
Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack

We’re growing and expanding our products every day. If you’re up to the challenge and are excited about AI and ML as much as we are, join us!

MLOps Engagement Engineer

About Nebius AI

The role

Your responsibilities will include:

We expect you to have:

It will be an added bonus if you have:

Apply for this job