Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

WTV2emVTSlc0RUh0Z2JseUJQOSt2QW5uN1E9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

HDR

Structural EIT/Inspector Job at HDR

 ...Join to apply for the Structural EIT/Inspector role at HDR . At HDR, our employee-owners are fully engaged in creating a welcoming environment where each of us is valued and respected. We foster diversity, equity, and inclusion, and encourage everyone to bring their... 

Charlotte Animal Referral & Emergency

Oncology Veterinary Technician- RVT/ Vet Tech/ Experienced Vet Assistant Job at Charlotte Animal Referral & Emergency

We are a privately owned 24-hour Emergency and Specialty Referral Hospital in Charlotte NC that offers its employees a fun, supportive work environment, terrific benefits, and opportunities for growth! CARE | Charlotte Animal Referral & Emergency is seeking talented ...

GHR Healthcare - Travel Division

Per Diem / PRN Nurse RN - ED - Emergency Department Job at GHR Healthcare - Travel Division

 ...Description GHR Healthcare - Travel Division is seeking a per diem / prn nurse RN ED - Emergency Department for a per diem / prn nursing job in...  ...license; 2+ years relevant experience; BLS; ACLS; every other weekend / Contact (***) ***-**** to apply About GHR Healthcare -... 

Wyoming Staffing

General Construction Laborer Job at Wyoming Staffing

 ...divh2Concrete Finishers/Laborers And Construction Framers/Laborers/h2pLooking for experienced concrete finishers/laborers and experienced construction framers/laborers. Perform tasks involving physical labor at construction sites. May operate hand and power tools of all...