- NVIDIA (Santa Clara, CA)
- …+ Implement monitoring and health management capabilities that enable industry-leading reliability , availability, and scalability of GPU assets. You will be ... systems (Kubernetes, SLURM) + Understanding of performance, security and reliability in complex distributed systems. Familiarity with system level architecture,… more
- NVIDIA (Santa Clara, CA)
- …and board designers, software/firmware engineers, HW/SW applications engineering, process/ reliability specialists, ATE engineers, product managers, sales, and ... path analysis, power analysis, process technologies, transistor/device physics, silicon reliability , and aging mechanisms. + Familiarity with Perl, C/C++, tool… more
- Amazon (San Francisco, CA)
- …are responsible for the app architecture, developer onboarding, mobile app releases, reliability , and ensuring production issues gets routed to the right teams and ... language experience - 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience - Experience as a… more
- NVIDIA (Santa Clara, CA)
- …Implementing monitoring and health management capabilities that enable industry leading reliability , availability, and scalability of GPU assets. You will be ... Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance. Evaluating system failures and improving… more
- NVIDIA (Santa Clara, CA)
- …and board designers, software/firmware engineers, HW/SW applications engineering, process/ reliability specialists, ATE engineers, product managers, sales, and ... path analysis, power analysis, process technologies, transistor/device physics, silicon reliability , and aging mechanisms. + Familiarity with Perl, C/C++, tool… more
- Ford Motor Company (Palo Alto, CA)
- …specifications, and interfaces considering factors such as bandwidth, latency, reliability , and scalability . Characterize the Vehicle Network by laying ... performance analysis, simulation, and testing to validate the functionality, reliability , and robustness of network communication systems under various operating… more
- Google (Sunnyvale, CA)
- …who use Google services around the world. We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running ... a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud's Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers. The US base… more
- Google (Sunnyvale, CA)
- …for customer events and launches, partnering with Support, Engineers, and Site Reliability Engineers to ensure customer success, and work with customers and support ... to guide issues/escalations to resolution. + Develop best practices and assets based on learnings from customer engagements to support initiatives to scale through partners and accelerate Google Cloud adoption. Google is proud to be an equal opportunity… more
- Google (Sunnyvale, CA)
- …who use Google services around the world. We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running ... a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud's Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers. The US base… more
- Google (Mountain View, CA)
- …who use Google services around the world. We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running ... a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud's Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers. The US base… more