- NVIDIA (Santa Clara, CA)
- …design, network validation and troubleshooting + Proven expertise in designing large-scale distributed systems , AI clusters, or HPC infrastructure + Ability ... is building the world's most groundbreaking and innovative accelerated computing platforms for AI and HPC . Because of our work, scientists, researchers, and… more
- Amazon (Cupertino, CA)
- Description We are seeking an experienced engineer to work on distributed AI /ML systems . This role involves working on collective operations - the fundamental ... operations that enable AI to scale across multiple accelerators & servers. Most...building networking solutions that for Machine Learning (ML) and High- Performance Computing ( HPC ) workloads on AWS. We… more
- NVIDIA (Santa Clara, CA)
- …technical leader to define a vision and roadmap for distributed data platform and observability systems for large-scale AI and HPC clusters and workloads and ... and visualization to spectacularly improve efficiency, performance , and productivity of AI and HPC workloads. You will lead technical teams to develop,… more
- NVIDIA (Santa Clara, CA)
- …by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our tightly ... software architecture. With a targeted charter to enable best-in-class datacenter-scale performance and efficiency for our next generation of datacenter products,… more
- NVIDIA (Santa Clara, CA)
- …Prepare and deliver technical presentations and workshops to customers + Address and optimize customer AI systems performance issues What we need to see: + ... and interpersonal skills to analyze, define, implement and optimize AI /ML and HPC software and system solutions...performance + Experience in designing, running and troubleshooting performance benchmarks for AI systems … more
- Oracle (Sacramento, CA)
- …network fabric** , supporting millions of devices, multi-region interconnects, and high- performance compute ( HPC / AI /GPU) environments. + Integrate ML ... Development Team within OCI's Network Availability organization. This team builds the AI , analytics, and automation systems that power OCI's self-healing cloud… more
- NVIDIA (Santa Clara, CA)
- …and Data Structures, Computer Architecture, Compiler Development, Open Source Programming, High- Performance Computing ( HPC ) , Automation Tools (XLA, TVM, ... you're expressing interest in one of our 202 6 Systems Software Engineering Internships. We'll review resumes on an...challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest… more
- NVIDIA (Santa Clara, CA)
- …etc. + Familiarity with at scale GPU systems in general, encompassing performance testing, AI benchmarking, and more. + Practical involvement in cluster ... expertise in data center design, development and execution for AI and HPC . + Efficient time management...HPC cluster settings. + Practical knowledge of NVIDIA systems technology such as NCCL, DCGM, UFM, Mission Control,… more
- NVIDIA (Santa Clara, CA)
- …10+ years of experience in at least two of the following: HPC /large-scale cluster administration, Linux systems engineering, infrastructure automation (eg, ... optimization. + Hands-on experience using cluster telemetry and dashboard tools to assess HPC and AI clusters (eg, Prometheus, Grafana, DCGM, and similar… more
- Oracle (Sacramento, CA)
- …the forefront of building a cutting-edge, ultra-high- performance GPU platform designed to support AI /ML/ HPC workloads. This is your chance to be part of the ... AI revolution, creating systems that allow customers...and diagnostic services. These are essential for running distributed AI /ML/ HPC workloads across thousands of GPUs, leveraging… more