Observability Architect
Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms uniquely committed to helping local businesses succeed on a performance basis.
Groupon is on a radical journey to transform our business with relentless pursuit of results. Even with thousands of employees spread across multiple continents, we still maintain a culture that inspires innovation, rewards risk-taking and celebrates success. The impact here can be immediate due to our scale and the speed of our transformation. We're a "best of both worlds" kind of company. We're big enough to have the resources and scale, but small enough that a single person has a surprising amount of autonomy and can make a meaningful impact.
Position Summary:
The Observability Architect is responsible for designing and implementing comprehensive observability solutions to ensure the health, performance, and reliability of enterprise applications and infrastructure. This role involves defining strategies and frameworks for monitoring, logging, tracing, and alerting, enabling proactive issue detection and resolution. The Observability Architect will collaborate with cross-functional teams to build scalable observability systems that support the organization’s business objectives.
Key Responsibilities:
Strategy and Framework Development:
Develop and maintain the observability strategy, ensuring alignment with business goals and technology standards.
Create frameworks and best practices for observability, including monitoring, logging, tracing, and alerting.
System Design and Implementation:
Design and implement scalable observability solutions across various platforms and technologies.
Integrate observability tools and platforms (e.g., Prometheus, Grafana, ELK Stack, Jaeger) into existing infrastructure.
Ensure end-to-end visibility into system performance, health, and reliability.
Administer GitHub Enterprise Server including upgrade and maintenance.
Ability to design CI/CD flows, develop maintainable & extensible code/pipelines
Collaboration and Integration:
Work closely with DevOps, development, and IT operations teams to integrate observability practices into the software development lifecycle.
Partner with stakeholders to understand requirements and translate them into observability solutions.
Data Analysis and Insights:
Analyze monitoring and logging data to identify trends, patterns, and potential issues.
Develop dashboards and reports to provide insights into system performance and reliability.
Use observability data to drive continuous improvement initiatives.
Incident Management and Troubleshooting:
Establish and maintain alerting and escalation processes for timely issue detection and resolution.
Lead incident response efforts, utilizing observability tools to diagnose and resolve issues.
Continuous Improvement:
Stay updated with the latest trends and advancements in observability and monitoring technologies.
Continuously evaluate and enhance observability tools and practices to improve system reliability and performance.
Qualifications:
Education and Experience:
Bachelor’s degree in Computer Science, Information Technology, or a related field.
Minimum of 5 years of experience in observability, monitoring, or related fields.
Proven experience in designing and implementing observability solutions.
Technical Skills:
Proficiency in observability tools and platforms (e.g., Prometheus, Grafana, ELK Stack, Jaeger, Splunk).
Strong understanding of cloud infrastructure (AWS, Azure, Google Cloud) and containerization technologies (Docker, Kubernetes).
Familiarity with scripting and automation (e.g., Python, Shell, Ansible).
Soft Skills:
Excellent problem-solving and analytical skills.
Strong communication and collaboration abilities.
Ability to work independently and as part of a team.
Attention to detail and a commitment to delivering high-quality work.
Preferred Qualifications:
Certifications in relevant technologies (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator).
Experience with AIOps and machine learning for observability.
Working Conditions:
This role may require on-call responsibilities for incident management and resolution.
Flexible working hours and remote work options may be available.
Groupon’s purpose is to build strong communities through thriving small businesses. To learn more about the world’s largest local e-commerce marketplace, click here. You can also find out more about us in the latest Groupon news as well as learning about our DEI approach. If all of this sounds like something that’s a great fit for you, then click apply and join us on a mission to become the ultimate destination for local experiences and services.
Beware of Recruitment Fraud: Groupon follows a merit-based recruitment process without charging job seekers any fees. We've noticed an increase in recruitment fraud, including fake job postings and fraudulent interviews and job offers aimed at stealing personal information or money. Be cautious of individuals falsely representing Groupon's Talent Acquisition team with fake job offers. If you encounter any suspicious job offers or interview calls demanding money, recognize these as scams. Groupon is not responsible for losses from such dealings. For legitimate job openings (and a sneak peek into life at Groupon), always check our official career website at grouponcareers.com.
Apply for this job
*
indicates a required field