- xAI (Palo Alto, CA)
- …should have a Bachelor's degree and 3+ years of experience in site reliability engineering or systems engineering, alongside strong problem-solving and ... AI company in Palo Alto is seeking a Software Engineer to join its SuperComputing team. In this role,...SuperComputing team. In this role, you will ensure the reliability and performance of HPC infrastructure while… more
- Crusoe (San Francisco, CA)
- Staff Site Reliability Engineer , Compute...ensuring performance, security, and scale for modern AI and HPC workloads. What You'll Be Working On In this ... Join to apply for the Staff Site Reliability Engineer , Compute role at Crusoe Crusoe's...focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU/NIC resources. You… more
- Hamilton Barnes ? (San Francisco, CA)
- …B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer /Senior Site Reliability Engineer , you'll own the ... utilisation, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks… more
- Crusoe (San Francisco, CA)
- …we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This ... AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for...ensuring performance, security, and scale for modern AI and HPC workloads. What You'll Be Working On In this… more
- Hamilton Barnes Associates Limited (San Francisco, CA)
- …utilization, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks ... and drive continuous improvements in reliability , latency, and throughput. Skills / Must Have 7+...Loki) and incident response frameworks. Familiarity with high‑performance computing ( HPC ) or AI/ML training infrastructure at scale. Background in… more
- NVIDIA (Santa Clara, CA)
- …drive foundational improvements and automation to improve engineer 's productivity. As a Site Reliability Engineer , you are responsible for the big ... be doing: + Troubleshoot incoming support requests in a large-scale HPC environment. + Contribute enhancements to existing deployment automation, configuration… more
- SpaceX (Hawthorne, CA)
- Site Reliability Engineer , GNC (Falcon)...maintain virtual and physical servers + Work with SpaceX HPC team to monitor and maintain a 4000+ thread ... the ultimate goal of enabling human life on Mars. SITE RELIABILITY ENGINEER , GNC (FALCON)...HPC cluster + Closely collaborate with GNC software engineers… more
- SLAC National Accelerator Laboratory (Menlo Park, CA)
- Senior High Performance Computing Engineer Job ID 6383 Location SLAC - Menlo Park, CA Full-Time Regular **SLAC Job Postings** **About SLAC:** The SLAC National ... the nature of this position, SLAC is open to on- site and hybrid work options.** **Position Overview:** As a...options.** **Position Overview:** As a Senior High Performance Computing Engineer in the Scientific Computing Services Division of the… more
- Amazon (Cupertino, CA)
- Description We are seeking an experienced engineer to work on distributed AI/ML systems. This role involves working on collective operations - the fundamental ... Experience with embedded systems is valued, and experience with high-speed networking or HPC interconnects is valued highly. If you like solving hard problems, want… more
- Cadence Design Systems, Inc. (San Jose, CA)
- …Job Description: We are looking for a highly skilled and motivated 3DIC Design Flow Engineer to implement system planning and integration of complex HPC and AI ... AI applications. This is a challenging and rewarding opportunity for a highly motivated engineer with a passion for innovation and a proven track record of success… more