- NVIDIA (Santa Clara, CA)
- …operational capacity of our bare-metal, accelerated compute infrastructure and codify reliability best-practices in the broader DGX Cloud platform ecosystem. What ... working with or developing multi-cloud infrastructure services. Experience teaching reliability engineering (eg SRE) and/or other scale-oriented cloud systems… more
- Florida Crystals Corporation (Crockett, CA)
- …RESPONSIBILITIES + Drives utility improvements in cost, availability, and reliability , while insuring full regulatory compliance, including environment, health, ... ability to support equipment and technology. + Develops and Reviews operation, reliability , and maintenance reports and statistics to plan and modify utilities… more
- NVIDIA (Santa Clara, CA)
- …other leads to design & build data center health management workflow. + Drive reliability and optimization in firmware architecture from a data center view point. + ... Speed of Light + Own firmware delivered to data centers in terms of quality, reliability and telemetry performance. What we need to see: + 15+ years of relevant… more
- NVIDIA (Santa Clara, CA)
- …+ Implement monitoring and health management capabilities that enable industry-leading reliability , availability, and scalability of GPU assets. You will be ... systems (Kubernetes, SLURM) + Understanding of performance, security and reliability in complex distributed systems. Familiarity with system level architecture,… more
- Amazon (Cupertino, CA)
- …systems that validate hardware quality in manufacturing; monitoring and improving hardware reliability in data centers and platform. We cover everything from low ... while taking into consideration our customer needs from a cost, performance, and reliability perspective. About the team Why AWS Amazon Web Services (AWS) is the… more
- Amazon (Cupertino, CA)
- …systems that validate hardware quality in manufacturing; monitoring and improving hardware reliability in data centers and platform. We cover everything from low ... while taking into consideration our customer needs from a cost, performance, and reliability perspective. About the team Within AWS AWS Board Core Design & Services… more
- DoorDash (San Francisco, CA)
- …operating a high performance, scalable, reliable data abstraction layer that optimizes reliability and efficiency. You will help us bootstrap and scale our internal ... distributed database infrastructure centered around Cassandra, with a focus on reliability , operability, security and efficiency. This infrastructure will be the… more
- DoorDash (Sunnyvale, CA)
- …and support strategic initiatives in the search platform. + Build for scale, reliability , and efficiency - Lead design and implementation of critical components to ... or equivalent + 8+ years of industry experience + Passion for reliability & performance - you've built and operated large-scale low-latency distributed systems… more
- NVIDIA (Santa Clara, CA)
- …Implementing monitoring and health management capabilities that enable industry leading reliability , availability, and scalability of GPU assets. You will be ... Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance. Evaluating system failures and improving… more
- DoorDash (San Francisco, CA)
- …platform into one that can operate at massive scale with unwavering reliability . + Enjoy solving complex distributed systems problems where reliability , ... latency, throughput, and correctness are all critically important. + Are driven by impact at scale, knowing that your work will directly influence decision-making across the company. + Get energy from simplifying complex architectures without compromising… more