Back to jobs
Staff Site Reliability Engineering (SRE)
Full remote · Paris, Paris, France
About the job
Alma shapes the fintech landscape. We strive to serve and empower consumers and merchants by developing innovative solutions that redefine their purchase experience.
About the mission
- Organize and prioritize SRE roadmaps to ensure that the infrastructure is aligned with customer needs (internal and external)
- Lead cross-functional initiatives within the product teams.
- Regularly interact with stakeholders and senior management, ensuring alignment and effective communication on key initiatives.
- Promote automation and SRE best practices to optimize operational efficiency.
- Develop and maintain backup and disaster recovery strategies to protect data and ensure business continuity.
- Design, implement and maintain monitoring tools to track key system metrics, health indicators and our SLAs/SLOs.
- Provide technical support and expertise to engineering teams for the resolution of application and infrastructure incidents.
- Carry out in-depth analyzes of incidents in order to identify the underlying causes and put in place corrective measures.
- Maintain the platform in operational condition by implementing updates, security patches and continuous improvements.
- Participate in the optimization of the operating costs of the platform.
- Supporting and guide SREs through knowledge-sharing and collaboration, fostering continuous improvement across the team
About you
- At least 8 years in the management of cloud infrastructures.
- You also have experience in project management, enabling you to oversee and drive initiatives from planning to successful delivery
- Strong presentation and communication skills to collaborate with different teams and share problems and solutions effectively.
- Deep knowledge of Google Cloud Platform or other cloud providers.
- Good network knowledge.
- Experience in setting up and maintaining monitoring tools, analyzing metrics and malfunctions.
- Practice of Infrastructure as code.
- Ability to solve problems methodically and work effectively under pressure during critical incidents.
- Practice of English.
Our technical stack
- Cloud providers: GCP, CloudFlare, AWS
- Backend: Python + FastAPI and Flask
- Frontend: React / Typescript
- Databases technologies: PostgreSQL, Redis, BigQuery
- Log and error management: Datadog, Sentry
- CI/CD: Github Actions, Docker
- Monitoring: Datadog
- Infrastructure as Code: Terraform
About the recruitment process
- Interview with Talent Acquisition (30-45 min)
- Interview with Engineering Manager (45-60 min)
- Take-home Coding test, followed by a remote feedback session and a system design test (90 min)
- Team Fit interview (30 min)
Apply for this job
*
indicates a required field