Site Reliability Engineer I (SRE)
Site Reliability Engineer I (SRE)
Location: San Diego, CA (initial onsite onboarding), with remote options available in select U.S. states
Full-Time | Engineering | Mission-Critical Platform
Our client — a fast-growing, global technology leader — is seeking a Site Reliability Engineer I (SRE) to join their high-impact infrastructure team.
This role is ideal for someone passionate about scaling reliable, high-performance systems in a cloud-native, automation-driven environment. You’ll work on large-scale, mission-critical applications powering real-time services used by millions.
What You'll Do
Ensure 24/7 system availability by participating in an on-call rotation and responding to production incidents.
Monitor and manage service capacity, scaling infrastructure in collaboration with engineering and ops teams.
Own and operate core open-source services such as Elasticsearch, Kafka, RabbitMQ, and Redis.
Improve observability and system resiliency by building tools and optimizing infrastructure.
Define and monitor SLOs/SLIs, reduce MTTR, and drive service reliability efforts end-to-end.
Maintain runbooks, network diagrams, technical documentation, and automate manual operations.
Support system design efforts for reliability, scalability, and fault tolerance across distributed systems.
Proactively resolve issues through full-stack debugging, root cause analysis, and automation.
What You Bring
Bachelor’s degree in Computer Science, Information Systems, or related technical field.
3+ years of experience in 24/7 high-traffic, mission-critical environments, preferably in the cloud.
Strong troubleshooting skills across application, infrastructure, and network layers.
Experience with observability stacks (e.g., Grafana, Prometheus, Zabbix).
Familiarity with cloud architecture (load balancers, CI/CD, caching, distributed systems).
Proficiency in scripting/programming languages like Python or Go.
Hands-on experience with container technologies (Docker, Kubernetes, etc.).
Working knowledge of open-source systems like Kafka, Redis, Elasticsearch.
Passion for automation and eliminating manual toil through software-driven solutions.
Excellent written and verbal communication, comfortable collaborating with globally distributed teams.
Nice to Have
Background in big data environments (e.g., Hadoop, Hive, Spark).
Solid understanding of Linux systems and internals.
Familiarity with SRE best practices and production-grade incident response.
Compensation & Perks
Base Salary: $101,400 – $166,800/year
Compensation: Bonus + RSU/equity offering
Insurance: Medical, dental, vision, life, disability
401(k): With discretionary company match and financial advising
Time Off: Generous PTO, sick leave, floating holidays
Work Environment:
Weekly catered lunches
Free snacks & beverages
Dog-friendly office (select locations)
Gym access (select locations)
Frequent swag drops, annual holiday events, and pop-up activations
Additional Info
Remote work is available for candidates based in CA, DC, IN, MD, NC, PA, VA, WA, TX, and NY.
Initial onboarding may require temporary relocation to San Diego (10 days to 3 months, expenses covered).
Mandarin proficiency is a plus, as you'll collaborate with technical teams based in China.
Ready to help scale resilient systems powering global operations?
Apply now and become part of a team committed to performance, automation, and reliability at scale.