Principal Observability Platform Engineer
Principal Observability Platform Engineer – Nscale
About Nscale
Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies AI development while enabling superior results, supporting strategic business outcomes such as cost management, rapid innovation, and environmental responsibility.
We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency while contributing to the technology that powers the future.
About the Role
As a Principal/Staff Observability Platform Engineer, you'll own the technical direction of Nscale's observability platform: the systems that give us deep visibility into GPU clusters, AI workloads, and the infrastructure running them. You treat observability as a product and a discipline, not a tooling exercise. You'll set the architectural roadmap, raise the engineering bar across teams, and ensure our platform scales ahead of the business, not behind it.
You understand that complexity is a cost. Solutions that require constant babysitting don't scale, and neither does operational burden. The platforms you build should be simple to operate, easy to understand, and self-evidently correct when something goes wrong.
This isn't a "maintain and operate" role. It's a "define, build, and lead" role.
What You'll Do
- Own the technical strategy and architecture for observability across metrics, logs, traces, and alerting at scale.
- Drive platform decisions that have multi-year impact: tooling, data models, ingestion patterns, retention, cardinality management.
- Identify systemic gaps before they become incidents; design platforms that make failure visible and fast to diagnose.
- Partner with SRE, infrastructure, and AI/ML teams to embed observability natively into how Nscale builds and operates.
- Define standards and patterns that other engineers adopt, not by mandate, but because they're clearly better.
- Mentor and technically grow the observability team; raise the ceiling on what the team can build and own.
- Lead incident postmortems and use them to drive durable platform improvements.
- Evaluate and introduce tooling that meaningfully improves signal quality, operational efficiency, or scalability, and retire what doesn't.
About You
- 8+ years in SRE, infrastructure engineering, platform engineering, or observability-focused roles.
- You've operated observability infrastructure at serious scale. You know what breaks at 10x and you design for it.
- You have a strong bias toward simplicity. You've seen over-engineered observability stacks collapse under their own weight and you build accordingly.
- Deep hands-on experience with a significant subset of: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic.
- Strong engineering fundamentals, proficient in Python, Go, or similar; comfortable owning complex systems end to end.
- Experience with Kubernetes at scale; familiarity with GPU infrastructure or HPC environments (Slurm) is a strong plus.
- You can architect systems, write the code, review others' work, and explain the tradeoffs clearly, all in the same week.
- Infrastructure-as-Code is default, not optional (Terraform, Ansible, or equivalent).
- You influence without authority. Teams want your opinion because it makes their work better.
Preferred
- Experience with high-volume streaming pipelines for observability data (Kafka, Vector, Fluent Bit, etc.).
- Background in AI/ML infrastructure observability: GPU utilisation, training job visibility, inference latency.
- Prior experience defining observability strategy at an organisation level.
Equal Opportunities Statement
We strongly encourage applications from people of color, the LGBTQ+ community, people with disabilities, neurodivergent individuals, parents, carers, and people from lower socio-economic backgrounds.
If there’s anything we can do to accommodate your specific situation, please let us know.
Note: Responsibilities outlined are not exhaustive and may evolve as business needs change.
The range below reflects the base salary for the position. Actual compensation may vary based on job-related factors such as skill set, experience, education, and location. In addition to base salary, this role may be eligible for bonus, equity, and/or commission programs. Nscale may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.
Salary Range
$150,000 - $215,000 USD
For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.
Apply for this job
*
indicates a required field
