Senior Site Reliability Engineer

San Diego, CA
Full Time
Mid Level

Senior Site Reliability Engineer
Locations: San Diego, CA or Seattle, WA | Full-Time | Hybrid or Remote Eligible in Select States
Engineering | High-Traffic, Mission-Critical Systems

Our client — a fast-growing, global technology leader — is seeking a Senior Site Reliability Engineer to join their high-impact infrastructure team. As a trusted staffing partner, we are managing the hiring process on their behalf.

This role is ideal for someone passionate about scaling reliable, high-performance systems in a cloud-native, automation-driven environment. You’ll work on large-scale, mission-critical applications powering real-time services used by millions.

What You’ll Be Doing

  • Ensure 24/7 uptime by participating in a rotating on-call schedule and managing production incidents across distributed environments.

  • Operate and maintain core systems like Elasticsearch, Kafka, RabbitMQ, Redis, with a focus on reliability and performance.

  • Architect monitoring solutions, define SLOs/SLIs, and implement scalable observability tools (e.g., Grafana, Prometheus, Zabbix).

  • Collaborate with engineering teams to optimize capacity, auto-scaling, and system utilization.

  • Develop and maintain automation tools and workflows to support a culture of minimal manual intervention.

  • Troubleshoot infrastructure bottlenecks and improve full-stack performance across services.

  • Own the design and execution of new infrastructure patterns to support continued scale and speed.

  • Maintain clear technical documentation including runbooks, incident response procedures, and architectural diagrams.

What You Bring

  • Bachelor’s degree in Computer Science, Information Systems, or a related technical field.

  • 5+ years of experience supporting mission-critical, real-time, high-traffic systems in a cloud-based or hybrid production environment.

  • Deep expertise in Linux, distributed systems, cloud architecture, and containerized workloads (Docker, Kubernetes, etc.).

  • Skilled in system-level debugging and end-to-end performance optimization.

  • Strong programming/scripting ability in Python, Go, or similar.

  • Experience managing OSS components such as Kafka, Elasticsearch, Redis, and more.

  • Proven ability to reduce incident rates and drive down MTTR through process improvements and tooling.

  • Excellent communication skills and experience working across distributed teams.

Bonus Points

  • Experience with big data infrastructure (e.g., Hadoop, Spark, Hive, HBase).

  • Background in data infrastructure, DBRE, or DBA responsibilities at scale.

  • Familiarity with service mesh technologies and zero-trust architectures.

Compensation & Perks

  • Base Salary: $107,600 – $180,200/year

  • Compensation: Includes annual bonus + equity (RSU)

  • Benefits:

    • Full medical, dental, and vision insurance

    • HSA with company contributions + FSA options

    • 401(k) plan with discretionary company match and financial advising

    • Company-paid life, AD&D, short-term & long-term disability insurance

    • Paid holidays, generous PTO, and floating days

    • Employee discounts and perks

    • Weekly catered lunches, stocked snacks, and beverages

    • Gym access & dog-friendly office (select locations)

    • Swag, holiday parties, and internal community events

Additional Notes

  • Hybrid work setup encouraged (onboarding may require brief relocation to San Diego).

  • Remote work options available in: CA, WA, NY, TX, PA, DC, VA, MD, NC, IN.

  • Mandarin fluency is a plus for collaboration with global technical teams.


Ready to help build systems that scale?
Apply now to join a high-impact engineering team building some of the most reliable digital infrastructure on the planet.

Share

Apply for this position

Required*
Apply with Indeed
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*