Senior Site Reliability Engineer
Senior Site Reliability Engineer
Locations: San Diego, CA or Seattle, WA | Full-Time | Hybrid or Remote Eligible in Select States
Engineering | High-Traffic, Mission-Critical Systems
Our client — a fast-growing, global technology leader — is seeking a Senior Site Reliability Engineer to join their high-impact infrastructure team. As a trusted staffing partner, we are managing the hiring process on their behalf.
This role is ideal for someone passionate about scaling reliable, high-performance systems in a cloud-native, automation-driven environment. You’ll work on large-scale, mission-critical applications powering real-time services used by millions.
What You’ll Be Doing
Ensure 24/7 uptime by participating in a rotating on-call schedule and managing production incidents across distributed environments.
Operate and maintain core systems like Elasticsearch, Kafka, RabbitMQ, Redis, with a focus on reliability and performance.
Architect monitoring solutions, define SLOs/SLIs, and implement scalable observability tools (e.g., Grafana, Prometheus, Zabbix).
Collaborate with engineering teams to optimize capacity, auto-scaling, and system utilization.
Develop and maintain automation tools and workflows to support a culture of minimal manual intervention.
Troubleshoot infrastructure bottlenecks and improve full-stack performance across services.
Own the design and execution of new infrastructure patterns to support continued scale and speed.
Maintain clear technical documentation including runbooks, incident response procedures, and architectural diagrams.
What You Bring
Bachelor’s degree in Computer Science, Information Systems, or a related technical field.
5+ years of experience supporting mission-critical, real-time, high-traffic systems in a cloud-based or hybrid production environment.
Deep expertise in Linux, distributed systems, cloud architecture, and containerized workloads (Docker, Kubernetes, etc.).
Skilled in system-level debugging and end-to-end performance optimization.
Strong programming/scripting ability in Python, Go, or similar.
Experience managing OSS components such as Kafka, Elasticsearch, Redis, and more.
Proven ability to reduce incident rates and drive down MTTR through process improvements and tooling.
Excellent communication skills and experience working across distributed teams.
Bonus Points
Experience with big data infrastructure (e.g., Hadoop, Spark, Hive, HBase).
Background in data infrastructure, DBRE, or DBA responsibilities at scale.
Familiarity with service mesh technologies and zero-trust architectures.
Compensation & Perks
Base Salary: $107,600 – $180,200/year
Compensation: Includes annual bonus + equity (RSU)
Benefits:
Full medical, dental, and vision insurance
HSA with company contributions + FSA options
401(k) plan with discretionary company match and financial advising
Company-paid life, AD&D, short-term & long-term disability insurance
Paid holidays, generous PTO, and floating days
Employee discounts and perks
Weekly catered lunches, stocked snacks, and beverages
Gym access & dog-friendly office (select locations)
Swag, holiday parties, and internal community events
Additional Notes
Hybrid work setup encouraged (onboarding may require brief relocation to San Diego).
Remote work options available in: CA, WA, NY, TX, PA, DC, VA, MD, NC, IN.
Mandarin fluency is a plus for collaboration with global technical teams.
Ready to help build systems that scale?
Apply now to join a high-impact engineering team building some of the most reliable digital infrastructure on the planet.