Site Reliability Engineer I (SRE)

San Diego, CA
Full Time
Experienced

Site Reliability Engineer I (SRE)
Location: San Diego, CA (initial onsite onboarding), with remote options available in select U.S. states
Full-Time | Engineering | Mission-Critical Platform

Our client — a fast-growing, global technology leader — is seeking a Site Reliability Engineer I (SRE) to join their high-impact infrastructure team. 

This role is ideal for someone passionate about scaling reliable, high-performance systems in a cloud-native, automation-driven environment. You’ll work on large-scale, mission-critical applications powering real-time services used by millions.

What You'll Do

  • Ensure 24/7 system availability by participating in an on-call rotation and responding to production incidents.

  • Monitor and manage service capacity, scaling infrastructure in collaboration with engineering and ops teams.

  • Own and operate core open-source services such as Elasticsearch, Kafka, RabbitMQ, and Redis.

  • Improve observability and system resiliency by building tools and optimizing infrastructure.

  • Define and monitor SLOs/SLIs, reduce MTTR, and drive service reliability efforts end-to-end.

  • Maintain runbooks, network diagrams, technical documentation, and automate manual operations.

  • Support system design efforts for reliability, scalability, and fault tolerance across distributed systems.

  • Proactively resolve issues through full-stack debugging, root cause analysis, and automation.

What You Bring

  • Bachelor’s degree in Computer Science, Information Systems, or related technical field.

  • 3+ years of experience in 24/7 high-traffic, mission-critical environments, preferably in the cloud.

  • Strong troubleshooting skills across application, infrastructure, and network layers.

  • Experience with observability stacks (e.g., Grafana, Prometheus, Zabbix).

  • Familiarity with cloud architecture (load balancers, CI/CD, caching, distributed systems).

  • Proficiency in scripting/programming languages like Python or Go.

  • Hands-on experience with container technologies (Docker, Kubernetes, etc.).

  • Working knowledge of open-source systems like Kafka, Redis, Elasticsearch.

  • Passion for automation and eliminating manual toil through software-driven solutions.

  • Excellent written and verbal communication, comfortable collaborating with globally distributed teams.

Nice to Have

  • Background in big data environments (e.g., Hadoop, Hive, Spark).

  • Solid understanding of Linux systems and internals.

  • Familiarity with SRE best practices and production-grade incident response.

Compensation & Perks

  • Base Salary: $101,400 – $166,800/year

  • Compensation: Bonus + RSU/equity offering

  • Insurance: Medical, dental, vision, life, disability

  • 401(k): With discretionary company match and financial advising

  • Time Off: Generous PTO, sick leave, floating holidays

  • Work Environment:

    • Weekly catered lunches

    • Free snacks & beverages

    • Dog-friendly office (select locations)

    • Gym access (select locations)

    • Frequent swag drops, annual holiday events, and pop-up activations

Additional Info

  • Remote work is available for candidates based in CA, DC, IN, MD, NC, PA, VA, WA, TX, and NY.

  • Initial onboarding may require temporary relocation to San Diego (10 days to 3 months, expenses covered).

  • Mandarin proficiency is a plus, as you'll collaborate with technical teams based in China.


Ready to help scale resilient systems powering global operations?
Apply now and become part of a team committed to performance, automation, and reliability at scale.

Share

Apply for this position

Required*
Apply with Indeed
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*