Senior Site Reliability Engineer
ABOUT TALON.ONE:
Talon.One is the most powerful incentives engine that unifies loyalty, promotions and gamification into one holistic platform. Backed by enterprise-grade security and scalability, Talon.One empowers companies to build personalized, profitable promotions and loyalty programs using any data.
Today, over 250 of the world’s most-loved brands including Adidas, Sephora and Carlsberg work with Talon.One to drive deeper engagement and lasting loyalty with their customers.
ABOUT THE TEAM:
As our Senior Site Reliability Engineer, you will own and drive reliability across the Talon.One platform. This is a hands-on senior role with broad impact. You will shape how we design, measure, and improve reliability across the entire engineering organization.
You will build and evolve our reliability foundations, from observability architecture and SLO frameworks to incident management and production standards. You will not only respond to incidents, but systematically eliminate their root causes. You will reduce operational toil through automation, improve signal quality across our monitoring systems, and guide engineering teams in building resilient, scalable services by design.
If you enjoy building practical systems, setting technical direction, and delivering measurable reliability improvements across a complex distributed platform, this role is for you.
ONCE YOU ARE HERE YOU WILL:
- Own reliability outcomes: availability, latency, error rates, and overall operational health.
- Define and introduce SLOs and error budgets to establish clear reliability targets and drive engineering prioritization.
- Guide the engineering organization with designs, standards, and best practices to ensure reliability and stability across the Talon.One product.
- Build and evolve observability across metrics, logs, and traces, making the system understandable, not just monitored.
- Design and improve our monitoring/observability architecture end-to-end, including data pipelines, signal quality, alert strategy, dashboards, and SLO implementation, and cost-aware scalability.
- Eliminate operational toil by building reliability tooling and automation that reduces repetitive work and improves system resilience.
- Drive structural improvements by identifying and addressing the underlying causes of incidents, not just managing their symptoms.
- Lead and continuously improve incident management: on-call readiness, severity handling, stakeholder communication, blameless postmortems, and strong follow-through.
- Drive continuous improvement: reduce noisy alerts, close reliability gaps, and automate recurring operational work.
- Work deeply in Kubernetes and cloud environments, especially Google Cloud, and make deployments safer and more predictable.
- Operate with GitOps principles: reliability changes are versioned, reviewed, traceable, and reproducible.
WHAT WE NEED YOU TO BRING TO THE TABLE:
- A strong sense of ownership for production health, proactively driving improvements in stability, performance, and resilience.
- The ability to establish and evolve SLO-driven reliability practices in an organization that is building this muscle.
- Strong observability instincts with a focus on signal over noise, turning metrics, logs, and traces into actionable insight through clean dashboards, meaningful alerts, and well-defined SLOs instead of alert fatigue.
- Hands-on experience with the Grafana stack, including Prometheus, Grafana Alloy, Loki, and Tempo, with practical knowledge of pipeline design, scaling considerations, and maintaining high signal quality.
- Experience designing or significantly improving monitoring and observability architectures across collection, storage, retention, cardinality control, tagging strategy, cost awareness, and ensuring the reliability of the observability stack itself.
- Solid understanding of Kubernetes workloads, networking, scaling patterns, and failure modes, with real-world experience operating systems in Google Cloud environments.
- Understanding of the OpenTelemetry protocol and its role in modern observability architectures.
- A proactive mindset. You bring solutions, clearly articulate design options and trade-offs, and drive initiatives through to completion.
- Strong communication skills under pressure. You explain clearly during incidents, align teams quickly, and document systems in a way others can follow.
- The ability to raise the reliability bar across teams by setting expectations, influencing engineering practices, and embedding a culture of observability and operational excellence.
OUR TECH STACK:
- Datadog and datadog agents
- Grafana Alloy, Prometheus, Loki, Tempo
- OpenTelemetry
- Kubernetes running on Google Cloud
- GitOps and ArgoCD
- Thanos
- 90+ team of engineers, product managers and product designers in Berlin
- Leaders with 8+ years of experience building our promotions engine
- €1,000 annual learning budget, full LinkedIn Learning access, and free German language courses to boost your skills
- 30 days of annual leave, plus extra paid days for your birthday and moving day
- Home office setup budget, a monthly home office allowance
- Freedom to work from abroad for up to 90 days worldwide!
- Mental health support with nilo.health and a discounted Urban Sports Club membership
- 20% company subsidy on your pension contributions
- Subsidised BVG public transport ticket and a dog-friendly Berlin office where your furry friend is welcome
- Lease your ideal bike through BusinessBike
Create a Job Alert
Interested in building your career at Talon.One LinkedIn Jobs? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field