ML Infrastructure Engineer
At JetBrains, code is our passion. Since 2000, we have strived to make the most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create.
The ML Research team at JetBrains applies machine learning – in particular reinforcement learning, agentic approaches, and federated learning – to help developers and enhance the software development process. At the heart of our work is a fast, flexible, and reliable infrastructure designed to run and scale experiments efficiently.
We’re looking for the ML Infrastructure Engineer to manage our research infrastructure and boost its efficiency and scalability. This role involves maintaining GPU clusters and standalone machines, building robust monitoring, and removing bottlenecks that slow down experimentation. You will work alongside ML researchers and engineers to solve practical problems across a wide range of projects involving distributed training, reinforcement learning, training agentic backbones, and more.
In this role, you will:
- Design, operate, and continuously improve our Kubernetes GPU cluster, including NVIDIA drivers, MIG, and high-speed networking (InfiniBand/NVLink).
- Manage and tune our job orchestrator (Ray) so researchers can launch distributed training and benchmarking jobs with minimal friction.
- Implement robust monitoring, logging, and alerting with Prometheus, Thanos, Loki, and Grafana to track resource utilization and optimize costs.
- Identify and resolve infrastructure bottlenecks to maximize GPU utilization.
- Collaborate closely with our SRE, IT, and Security teams to ensure our research environment integrates smoothly with company standards.
- Educate and support researchers on the infrastructure-related topics and troubleshoot ad-hoc requests.
We’ll be happy to have you on our team if you have:
- Proven hands-on experience administering Kubernetes clusters (control-plane operations, RBAC, CNI, storage, upgrades).
- Solid Linux fundamentals, including networking, containers, and troubleshooting.
- Experience diagnosing performance issues in distributed systems and setting up monitoring for them.
- Strong communication skills to work collaboratively with engineers and researchers of different backgrounds.
We would be especially thrilled if you:
- Have worked with GPU clusters.
- Have MLOps experience, including training pipelines, experiment tracking, and data versioning.
- Are familiar with infrastructure-as-code (Terraform, Ansible, or similar).
- Have worked with Ray, Slurm, Flyte, Airflow, or other workload orchestrators.
- Understand cloud platforms (AWS, GCP, or Azure) and hybrid/on-premises networking setups.
Create a Job Alert
Interested in building your career at JetBrains? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field

