Expert SRE Engineer
The Role: SRE Expert
As an SRE Expert, you will play a pivotal role in guiding tribe(s) toward achieving their Site Reliability Engineering objectives. You'll operate with a high degree of autonomy, balancing the priorities of individual tribes with those of the broader organization. This role requires strategic thinking, strong stakeholder alignment, and the ability to collaborate within a distributed team each member focusing on a specific tribe.
The Team
The team of expert leads has no hierarchical structure and reports directly to the IT lead. Each SRE Expert leads one or more tribes in their SRE practices and implementations. You will also participate in global SRE guilds and take responsibility for a set of SRE capabilities.
Roles & Responsibilities
As an SRE Expert, you will be dedicated to supporting the PSS SRE Team in achieving its objectives. The focus is on creating reliable and available services for customers. Depending on your skills, responsibilities can include:
-
Performing in-depth reviews and analysis of asset/service implementations regarding monitoring and alerting setups.
-
Assessing resilience in architectural designs and providing recommendations.
-
Delivering knowledge sessions on SRE topics (eg, SLI/SLO definition, resilience testing, DR plan reviews, toil reduction).
-
Supporting root cause analysis and post-mortems, driving improvements to prevent recurrence.
-
Using incident, problem, and change data analysis to identify structural improvements and advocate for them.
-
Providing hands-on support to application teams struggling with monitoring and alerting standards.
-
Contributing actively to the Engineering & Reliability (E&R) organization.
-
Participating in the global SRE guild.
This role requires flexibility, technical depth, and strong communication skills to help the organization evolve.
How to Succeed
We're looking for someone with a broad skill set across both modern and legacy IT stacks, a problem-solving mentality, and the curiosity to tackle any challenge.
Key skills include:
-
Extensive knowledge of Linux (RHEL) or Windows.
-
Strong SQL knowledge and familiarity with RDBMS (Oracle, MS SQL) or NoSQL (Cassandra).
-
Solid understanding of CI/CD standards (preferably Azure DevOps).
-
Experience with IT collaboration tools.
-
Background in monolithic application landscapes.
-
Proficiency in at least one programming language.
-
Strong networking knowledge (IPv4 and IPv6).
-
Experience with monitoring and alerting tools (Prometheus, ELK, Grafana). Knowledge of OpenTelemetry is a bonus.
-
Familiarity with SRE concepts (SLI, SLO, error budgeting, availability reporting).
-
Ability to engage with both consumers and engineers.
-
Strong problem-solving mentality and healthy skepticism.
-
Knowledge of containers and cloud tooling.
-
Ability to mentor and coach other SRE experts.
Nice to have:
-
Experience with OpenTelemetry.
-
Understanding of IT security principles.
Apply for this job
*
indicates a required field