
MLOps Engineer - Data Platform Team
Surf AI is building the world’s first context-driven, agentic security platform. We focus on systems that don’t just surface risk, but actively help organizations resolve it.
Surf is backed by Cyberstarts and Boldstart Ventures, investors behind category-defining security companies, and founded by repeat entrepreneurs with deep experience in identity, security, and enterprise risk.
We’re a small, senior team working at the intersection of security, AI, and distributed systems. Our work blends agentic systems with data-driven analysis and applied security research to operate safely in real enterprise environments.
Who are we looking for?
As an MLOps Engineer, you will architect and operate scalable, production-grade ML systems. You’ll build cloud-native infrastructure that supports reproducible experimentation, reliable deployment, and continuous monitoring.
You will also help shape the company’s ML architecture, contributing to foundational design decisions and long-term platform strategy.
What you'll do
- Build and operate end-to-end ML pipelines across training, validation, deployment, and monitoring.
- Design reproducible training workflows with experiment tracking and dataset versioning.
- Ensure data and feature consistency between training and production using feature stores and versioned data pipelines.
- Implement CI/CD for ML systems, including automated testing, evaluation gates, model packaging, and staged promotion.
- Manage model versioning, registry workflows, and artifact management.
- Deploy and operate models using scalable inference patterns (e.g., batch, real-time, async).
- Monitor model performance, data drift, concept drift, and infrastructure health with automated alerting and retraining triggers.
- Optimize ML infrastructure for performance, cost efficiency, and scalability.
- Collaborate with ML engineers to productionize research models into robust, observable systems.
Required Skills & Experience
- 3+ years building and operating distributed systems in production.
- 2+ years hands-on MLOps experience across the full model lifecycle.
- Strong experience with Kubernetes in production (deployments, autoscaling, observability, troubleshooting).
- Experience building CI/CD pipelines (GitOps preferred) for ML or data systems.
Solid understanding of model lifecycle management, reproducibility, and artifact/version control. - Experience with monitoring stacks (e.g., Prometheus/Grafana, OpenTelemetry) and ML observability.
- Experience with cloud platforms (AWS/GCP/Azure) and infrastructure-as-code (e.g., Terraform).
Nice to Have
- Experience with Databricks or similar ML platforms.
- Experience with orchestration frameworks such as Dagster or Airflow.
- Experience with feature stores and model registries (e.g., MLflow).
- BSc or MSc in Computer Science, Mathematics, or Engineering.
Why Join Us?
If you want to work on foundational systems, ship AI into production, and help define how agentic security actually operates, this is an opportunity to do it early and with real ownership.
Create a Job Alert
Interested in building your career at Surf AI? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field