Site Reliability Engineer I (SRE)

San Diego, CA

Full Time

Experienced

Site Reliability Engineer I (SRE)
Location: San Diego, CA (initial onsite onboarding), with remote options available in select U.S. states
Full-Time | Engineering | Mission-Critical Platform

Our client — a fast-growing, global technology leader — is seeking a Site Reliability Engineer I (SRE) to join their high-impact infrastructure team.

This role is ideal for someone passionate about scaling reliable, high-performance systems in a cloud-native, automation-driven environment. You’ll work on large-scale, mission-critical applications powering real-time services used by millions.

What You'll Do

Ensure 24/7 system availability by participating in an on-call rotation and responding to production incidents.
Monitor and manage service capacity, scaling infrastructure in collaboration with engineering and ops teams.
Own and operate core open-source services such as Elasticsearch, Kafka, RabbitMQ, and Redis.
Improve observability and system resiliency by building tools and optimizing infrastructure.
Define and monitor SLOs/SLIs, reduce MTTR, and drive service reliability efforts end-to-end.
Maintain runbooks, network diagrams, technical documentation, and automate manual operations.
Support system design efforts for reliability, scalability, and fault tolerance across distributed systems.
Proactively resolve issues through full-stack debugging, root cause analysis, and automation.

What You Bring

Bachelor’s degree in Computer Science, Information Systems, or related technical field.
3+ years of experience in 24/7 high-traffic, mission-critical environments, preferably in the cloud.
Strong troubleshooting skills across application, infrastructure, and network layers.
Experience with observability stacks (e.g., Grafana, Prometheus, Zabbix).
Familiarity with cloud architecture (load balancers, CI/CD, caching, distributed systems).
Proficiency in scripting/programming languages like Python or Go.
Hands-on experience with container technologies (Docker, Kubernetes, etc.).
Working knowledge of open-source systems like Kafka, Redis, Elasticsearch.
Passion for automation and eliminating manual toil through software-driven solutions.
Excellent written and verbal communication, comfortable collaborating with globally distributed teams.

Nice to Have

Background in big data environments (e.g., Hadoop, Hive, Spark).
Solid understanding of Linux systems and internals.
Familiarity with SRE best practices and production-grade incident response.

Compensation & Perks

Base Salary: $101,400 – $166,800/year
Compensation: Bonus + RSU/equity offering
Insurance: Medical, dental, vision, life, disability
401(k): With discretionary company match and financial advising
Time Off: Generous PTO, sick leave, floating holidays
Work Environment:
- Weekly catered lunches
- Free snacks & beverages
- Dog-friendly office (select locations)
- Gym access (select locations)
- Frequent swag drops, annual holiday events, and pop-up activations

Additional Info

Remote work is available for candidates based in CA, DC, IN, MD, NC, PA, VA, WA, TX, and NY.
Initial onboarding may require temporary relocation to San Diego (10 days to 3 months, expenses covered).
Mandarin proficiency is a plus, as you'll collaborate with technical teams based in China.

Ready to help scale resilient systems powering global operations?
Apply now and become part of a team committed to performance, automation, and reliability at scale.

Apply for this position

Required*

Apply with Indeed

First Name*

Last Name*

Email Address*

Phone*

Address

Resume*

We've received your resume. Click here to update it.

Attach resume or Paste resume

Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*

Submit Application

IntelliPro Group Inc.

Thanks for visiting our Job Board. Please review our open positions and apply to the positions that match your qualifications.