Job Application for SRE Leader at Bybit

About Us

Established in 2018, Bybit is one of the world’s leading cryptocurrency exchanges and digital financial platforms, serving over 80 million users across more than 200 countries and regions. Powered by world-class technology and a user-first mindset, Bybit delivers a seamless ecosystem across trading, payments, wealth management, custody, institutional services, and Web3 — connecting users to the future of digital finance.

Our core values define how we build. We listen, care and improve to create products and experiences that put users first. Backed by a global team of ambitious builders, problem-solvers, and innovators, we foster a high-performance and fast-moving environment where talent is empowered to drive real impact at the global scale. Supported by 24/7 multilingual customer service and a strong commitment to innovation, we are shaping the future of finance through technology, collaboration, and bold execution.

Today, Bybit is recognized as one of the most trusted and transparent platforms in the digital asset industry, continuing to expand its global presence while building the infrastructure for the next generation of financial services.

Core responsibilities

Construction of reliability engineering system

Establish a company-wide SLO/SLA system: Define quantifiable reliability indicators (availability, latency, error rate) for each Line of Business, and drive change rhythm and investment decisions based on Error Budget

Construct MTTD/MTTR measurement system, set grading goals and continuously optimize: P-1 target MTTD < 1min, MTTR < 5min

Building fault self-healing capabilities: automated fault detection → diagnosis → recovery link, reducing reliance on manual intervention

Promote chaos engineering practice: regularly conduct fault drills (Chaos Engineering) and actively discover weak links in the system

Establish a change risk control system: canary release standardization, change impact pre-assessment, automatic rollback mechanism

Cost Governance System (Key Points)

Building a Data-driven cost governance closed loop: from cost visualization → attribution analysis → optimization decision → execution verification → continuous monitoring of whole-link automation

Establish a scientific capacity planning model: based on the correlation model between business indicators (QPS/TPS/number of users) and resource consumption, instead of impulsive N-fold reservation

Promote the implementation of FinOps culture.

Line of Business/Application Cost Billing and Showback

Define cost efficiency metrics ($/transaction, $/user, $/QPS) and conduct industry benchmarking

Embed cost assessment into the resource request process to achieve 100% capacity assessment of new resources

Automated cost optimization engine:

Low-load automatic recognition and scaled-down recommendation (AI-based anomaly detection and prediction model)

Reserved Instance/Savings Plan Automated Purchase Decision System

Optimization of elastic volume expansion and contraction strategies: pre-scaling based on predictive models to reduce over-reservation

Automatic recycling and lifecycle management of idle resources

Goal: Annual cloud cost optimization of 15-20% without affecting business SLO.

III. Automated operation and maintenance (key)

Toil elimination system: measure team toil ratio (target < 30%), systematically identify and automate high-frequency repetitive operations

GitOps/IaC fully implemented:

Infrastructure 100% coded, all changes executed through PR review and automated pipeline

Environmental consistency guarantee: Ensure drift detection and automatic repair of dev/staging/prod configuration through IaC

Intelligent Operations and Maintenance (AIOps) Construction:

AI-based alarm aggregation, root cause analysis, and repair suggestions

Automatic detection of log/metric anomalies, moving from passive alarms to active discovery

Knowledge Base AI: natural language query operation status, execution standard operation

Self-service platform construction:

Business teams can complete more than 80% of routine operation and maintenance operations (volume expansion and contraction, configuration change, permission application) by themselves.

Operation and maintenance ticket automation processing rate target > 60%

On-call system optimization:

Alarm accuracy > 95% (eliminating alarm fatigue)

Establish Runbook automated execution capability

On-call quality measurement and continuous improvement

Financial cloud isolation and multi-compliance station deployment (key)

Financial-grade network isolation architecture design and operation and maintenance:

Design and implementation of network isolation strategies for multiple accounts, multiple VPCs, and multiple regions

Standardized management of security groups, end point nodes, and dedicated lines across compliance stations

Zero Trust Network architecture landing: micro-segmentation, minimum privilege, dynamic access control

Compliance station efficient building website ability:

Goal: Deployment of new compliance station infrastructure from weekly to hourly (fully automated)

Standardized Compliance Station Templates: One-click Delivery of Network Topology, Security Policy, Middleware, and Monitoring

Automated inter-site isolation verification: Regular automated scans ensure no cross-site data leakage

Cloudy and multi-regional operation and maintenance:

AWS/Tencent Cloud/Huawei Cloud unified operation and maintenance abstraction layer, shielding underlying differences

Cross-regional disaster recovery architecture design: RPO/RTO definition and walkthrough verification

Data Sovereignty Guarantee for Independent Deployment of Compliance Station (Data Residency, Encryption, Audit)

Financial-grade guarantee for wallet/transaction core chain.

Operation and maintenance guarantee of cold and hot wallet isolation architecture

Transaction link zero downtime change capability

Multiactive/disaster recovery switching SOP and periodic drills

Team Building and Talent Cultivation

Push the team to transform from "traditional operation and maintenance" to "Site Reliability Engineering": solve operation and maintenance problems with engineering methods

Establishing an SRE competency model and growth path: what abilities should be possessed at each level from P5 to P7 and how to measure them

Establish knowledge sedimentation and sharing mechanisms: Runbook, Post-mortem culture, internal Tech Talk

Eliminate single-point personnel risk: at least 2 people can handle each core system independently

Echelon Construction: Cultivate 2-3 senior SREs who can independently be responsible for Line of Business reliability

Job requirements

Required conditions

More than 10 years of experience in infrastructure/operations/SRE, and more than 5 years of experience leading a team of more than 10 people in SRE/Infra

Deep understanding of SRE methodology: SLO/SLI/Error Budget, Toil Management, Capacity Planning, Incident Management are not concepts but practices

Large-scale cost management practical experience:

Manage environments where annual cloud spending exceeds $5 million

Systematic FinOps practical experience (not brainstorming resources, but data-driven cost optimization)

Capable of capacity modeling: able to predict resource requirements based on business metrics

In-depth practice of automated operation and maintenance

Successful cases of reducing toil from > 50% to < 30%

Proficient in IaC tools (Terraform/Pulumi/CloudFormation) and experienced in large-scale implementation

Experience in exploring and implementing AIOps or intelligent operation and maintenance

Financial grade/compliance environment operation and maintenance experience

Infrastructure operation and maintenance experience in the financial industry (banks, exchanges, payments) or equivalent security requirements

Familiar with multi-account/multi-VPC network isolation architecture design

Experience in independent deployment and operation and maintenance of multiple regions and compliance stations

Understanding the infrastructure requirements of compliance frameworks such as Data Sovereignty, PCI-DSS, SOC2

Multi-cloud experience: AWS (required) + at least one other cloud (Tencent Cloud/GCP/Azure)

Programming ability: able to write operation and maintenance tools and automation systems in Go/Python (not writing scripts, but writing systems).

Bonus points

SRE management experience in cryptocurrency exchanges, traditional securities firms, or payment companies

Kubernetes large-scale cluster (100 + clusters/10000 + nodes) operation and maintenance experience

Familiar with the high availability architecture of the trading system (master-slave switching, multi-active deployment, zero downtime release).

Experience in building internal cost platforms or FinOps tools

Possessing practical experience in chaos engineering (Chaos Monkey/Litmus/self-developed)

Participated in infrastructure preparation work for compliance audits such as SOC2/ISO27001/PCI-DSS

The leadership traits we value

Engineering thinking: When facing operation and maintenance problems, the first reaction is "how to avoid such problems in the system" rather than "be careful next time".

Data drive: All decisions are based on metrics - not accepting "feels okay", not accepting "has always been like this".

- Cost awareness internalized: not passively doing cost optimization projects, but integrating cost efficiency into daily architectural decisions

- Scale thinking: When designing the plan, consider "If the number of compliance stations increases from 10 to 30, can this plan still work?"

Talent cultivator: able to cultivate "conventional" engineers into independent SRE experts with methods, patience, and standards

Why Join Us
At Bybit, we are committed to fostering a supportive and enriching work environment.
Our benefits include:
- Study Growth Fund: We support your professional development and continuous learning.
- Internal Events: Participate in regular team-building activities, workshops, and events designed to promote collaboration and innovation.
- Global Collaboration: Be part of a diverse, international team, working alongside colleagues from around the world.
- Career Advancement: Access opportunities for growth and advancement within a rapidly expanding global company.
- Internal Mobility: Grow with us- Your long-term development is important to us. We offer internal job opportunities to help build your career path.

SRE Leader

Apply for this job