Back to jobs
New

SRE Leader

Kuala Lumpur, Malaysia

About Us
Established in March 2018, Bybit is one of the fastest growing cryptocurrency derivatives exchanges, with more than 70 million registered users. We offer a professional platform where crypto traders can find an ultra-fast matching engine, excellent customer service and multilingual community support. We provide innovative online spot and derivatives trading services, mining and staking products, as well as API support, to retail and institutional clients around the world, and strive to be the most reliable exchange for the emerging digital asset class.

Our core values define us. We listen, care, and improve to create a faster, fairer, and more humane trading environment for our users. Our innovative, highly advanced, user-friendly platform has been designed from the ground-up using best-in-class infrastructure to provide our users with the industry's safest, fastest, fairest, and most transparent trading experience. Built on customer-centric values, we endeavour to provide a professional, 24/7 multi-language customer support to help in a timely manner.

As of today, Bybit is one of the most trusted, reliable, and transparent cryptocurrency derivatives platforms in the space.

Core responsibilities
 
Construction of reliability engineering system
 
  • Establish a company-wide SLO/SLA system: Define quantifiable reliability indicators (availability, latency, error rate) for each Line of Business, and drive change rhythm and investment decisions based on Error Budget
  • Construct MTTD/MTTR measurement system, set grading goals and continuously optimize: P-1 target MTTD < 1min, MTTR < 5min
  • Building fault self-healing capabilities: automated fault detection → diagnosis → recovery link, reducing reliance on manual intervention
  • Promote chaos engineering practice: regularly conduct fault drills (Chaos Engineering) and actively discover weak links in the system
  • Establish a change risk control system: canary release standardization, change impact pre-assessment, automatic rollback mechanism
 
Cost Governance System (Key Points)
 
  • Building a Data-driven cost governance closed loop: from cost visualization → attribution analysis → optimization decision → execution verification → continuous monitoring of whole-link automation
  • Establish a scientific capacity planning model: based on the correlation model between business indicators (QPS/TPS/number of users) and resource consumption, instead of impulsive N-fold reservation
  • Promote the implementation of FinOps culture.
    • Line of Business/Application Cost Billing and Showback
    • Define cost efficiency metrics ($/transaction, $/user, $/QPS) and conduct industry benchmarking
    • Embed cost assessment into the resource request process to achieve 100% capacity assessment of new resources
  • Automated cost optimization engine:
    • Low-load automatic recognition and scaled-down recommendation (AI-based anomaly detection and prediction model)
    • Reserved Instance/Savings Plan Automated Purchase Decision System
    • Optimization of elastic volume expansion and contraction strategies: pre-scaling based on predictive models to reduce over-reservation
    • Automatic recycling and lifecycle management of idle resources
  • Goal: Annual cloud cost optimization of 15-20% without affecting business SLO.
 
III. Automated operation and maintenance (key)
 
  • Toil elimination system: measure team toil ratio (target < 30%), systematically identify and automate high-frequency repetitive operations
  • GitOps/IaC fully implemented:
    • Infrastructure 100% coded, all changes executed through PR review and automated pipeline
    • Environmental consistency guarantee: Ensure drift detection and automatic repair of dev/staging/prod configuration through IaC
  • Intelligent Operations and Maintenance (AIOps) Construction:
    • AI-based alarm aggregation, root cause analysis, and repair suggestions
    • Automatic detection of log/metric anomalies, moving from passive alarms to active discovery
    • Knowledge Base AI: natural language query operation status, execution standard operation
  • Self-service platform construction:
    • Business teams can complete more than 80% of routine operation and maintenance operations (volume expansion and contraction, configuration change, permission application) by themselves.
    • Operation and maintenance ticket automation processing rate target > 60%
  • On-call system optimization:
    • Alarm accuracy > 95% (eliminating alarm fatigue)
    • Establish Runbook automated execution capability
    • On-call quality measurement and continuous improvement
 
Financial cloud isolation and multi-compliance station deployment (key)
 
  • Financial-grade network isolation architecture design and operation and maintenance:
    • Design and implementation of network isolation strategies for multiple accounts, multiple VPCs, and multiple regions
    • Standardized management of security groups, end point nodes, and dedicated lines across compliance stations
    • Zero Trust Network architecture landing: micro-segmentation, minimum privilege, dynamic access control
  • Compliance station efficient building website ability:
    • Goal: Deployment of new compliance station infrastructure from weekly to hourly (fully automated)
    • Standardized Compliance Station Templates: One-click Delivery of Network Topology, Security Policy, Middleware, and Monitoring
    • Automated inter-site isolation verification: Regular automated scans ensure no cross-site data leakage
  • Cloudy and multi-regional operation and maintenance:
    • AWS/Tencent Cloud/Huawei Cloud unified operation and maintenance abstraction layer, shielding underlying differences
    • Cross-regional disaster recovery architecture design: RPO/RTO definition and walkthrough verification
    • Data Sovereignty Guarantee for Independent Deployment of Compliance Station (Data Residency, Encryption, Audit)
  • Financial-grade guarantee for wallet/transaction core chain.
    • Operation and maintenance guarantee of cold and hot wallet isolation architecture
    • Transaction link zero downtime change capability
    • Multiactive/disaster recovery switching SOP and periodic drills
 
Team Building and Talent Cultivation
 
  • Push the team to transform from "traditional operation and maintenance" to "Site Reliability Engineering": solve operation and maintenance problems with engineering methods
  • Establishing an SRE competency model and growth path: what abilities should be possessed at each level from P5 to P7 and how to measure them
  • Establish knowledge sedimentation and sharing mechanisms: Runbook, Post-mortem culture, internal Tech Talk
  • Eliminate single-point personnel risk: at least 2 people can handle each core system independently
  • Echelon Construction: Cultivate 2-3 senior SREs who can independently be responsible for Line of Business reliability
 
 
Job requirements
 
Required conditions
 
  • More than 10 years of experience in infrastructure/operations/SRE, and more than 5 years of experience leading a team of more than 10 people in SRE/Infra
  • Deep understanding of SRE methodology: SLO/SLI/Error Budget, Toil Management, Capacity Planning, Incident Management are not concepts but practices
  • Large-scale cost management practical experience:
  • Manage environments where annual cloud spending exceeds $5 million
  • Systematic FinOps practical experience (not brainstorming resources, but data-driven cost optimization)
  • Capable of capacity modeling: able to predict resource requirements based on business metrics
  • In-depth practice of automated operation and maintenance
  • Successful cases of reducing toil from > 50% to < 30%
  • Proficient in IaC tools (Terraform/Pulumi/CloudFormation) and experienced in large-scale implementation
  • Experience in exploring and implementing AIOps or intelligent operation and maintenance
  • Financial grade/compliance environment operation and maintenance experience
  • Infrastructure operation and maintenance experience in the financial industry (banks, exchanges, payments) or equivalent security requirements
  • Familiar with multi-account/multi-VPC network isolation architecture design
  • Experience in independent deployment and operation and maintenance of multiple regions and compliance stations
  • Understanding the infrastructure requirements of compliance frameworks such as Data Sovereignty, PCI-DSS, SOC2
  • Multi-cloud experience: AWS (required) + at least one other cloud (Tencent Cloud/GCP/Azure)
  • Programming ability: able to write operation and maintenance tools and automation systems in Go/Python (not writing scripts, but writing systems).
 
Bonus points
 
  • SRE management experience in cryptocurrency exchanges, traditional securities firms, or payment companies
  • Kubernetes large-scale cluster (100 + clusters/10000 + nodes) operation and maintenance experience
  • Familiar with the high availability architecture of the trading system (master-slave switching, multi-active deployment, zero downtime release).
  • Experience in building internal cost platforms or FinOps tools
  • Possessing practical experience in chaos engineering (Chaos Monkey/Litmus/self-developed)
  • Participated in infrastructure preparation work for compliance audits such as SOC2/ISO27001/PCI-DSS
 
The leadership traits we value
 
Engineering thinking: When facing operation and maintenance problems, the first reaction is "how to avoid such problems in the system" rather than "be careful next time".
Data drive: All decisions are based on metrics - not accepting "feels okay", not accepting "has always been like this".
- Cost awareness internalized: not passively doing cost optimization projects, but integrating cost efficiency into daily architectural decisions
- Scale thinking: When designing the plan, consider "If the number of compliance stations increases from 10 to 30, can this plan still work?"
Talent cultivator: able to cultivate "conventional" engineers into independent SRE experts with methods, patience, and standards

Why Join Us
At Bybit, we are committed to fostering a supportive and enriching work environment. 
Our benefits include:
- Study Growth Fund: We support your professional development and continuous learning.
- Internal Events: Participate in regular team-building activities, workshops, and events designed to promote collaboration and innovation.
- Global Collaboration: Be part of a diverse, international team, working alongside colleagues from around the world.
- Career Advancement: Access opportunities for growth and advancement within a rapidly expanding global company.
- Internal Mobility: Grow with us- Your long-term development is important to us. We offer internal job opportunities to help build your career path.

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...
Select...
Select...
Select...