- NVIDIA (Santa Clara, CA)
- …design, network validation and troubleshooting + Proven expertise in designing large-scale distributed systems , AI clusters, or HPC infrastructure + Ability ... is building the world's most groundbreaking and innovative accelerated computing platforms for AI and HPC . Because of our work, scientists, researchers, and… more
- Amazon (Cupertino, CA)
- Description We are seeking an experienced engineer to work on distributed AI /ML systems . This role involves working on collective operations - the fundamental ... operations that enable AI to scale across multiple accelerators & servers. Most...building networking solutions that for Machine Learning (ML) and High- Performance Computing ( HPC ) workloads on AWS. We… more
- NVIDIA (Santa Clara, CA)
- …technical leader to define a vision and roadmap for distributed data platform and observability systems for large-scale AI and HPC clusters and workloads and ... and visualization to spectacularly improve efficiency, performance , and productivity of AI and HPC workloads. You will lead technical teams to develop,… more
- NVIDIA (Santa Clara, CA)
- …by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our tightly ... software architecture. With a targeted charter to enable best-in-class datacenter-scale performance and efficiency for our next generation of datacenter products,… more
- NVIDIA (Santa Clara, CA)
- …Prepare and deliver technical presentations and workshops to customers + Address and optimize customer AI systems performance issues What we need to see: + ... and interpersonal skills to analyze, define, implement and optimize AI /ML and HPC software and system solutions...performance + Experience in designing, running and troubleshooting performance benchmarks for AI systems … more
- Oracle (Sacramento, CA)
- …network fabric** , supporting millions of devices, multi-region interconnects, and high- performance compute ( HPC / AI /GPU) environments. + Integrate ML ... Development Team within OCI's Network Availability organization. This team builds the AI , analytics, and automation systems that power OCI's self-healing cloud… more
- NVIDIA (Santa Clara, CA)
- …and Data Structures, Computer Architecture, Compiler Development, Open Source Programming, High- Performance Computing ( HPC ) , Automation Tools (XLA, TVM, ... you're expressing interest in one of our 202 6 Systems Software Engineering Internships. We'll review resumes on an...challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest… more
- NVIDIA (Santa Clara, CA)
- …10+ years of experience in at least two of the following: HPC /large-scale cluster administration, Linux systems engineering, infrastructure automation (eg, ... optimization. + Hands-on experience using cluster telemetry and dashboard tools to assess HPC and AI clusters (eg, Prometheus, Grafana, DCGM, and similar… more
- NVIDIA (Santa Clara, CA)
- …etc. + Familiarity with at scale GPU systems in general, encompassing performance testing, AI benchmarking, and more. + Practical involvement in cluster ... expertise in data center design, development and execution for AI and HPC . + Efficient time management...HPC cluster settings. + Practical knowledge of NVIDIA systems technology such as NCCL, DCGM, UFM, Mission Control,… more
- Oracle (Sacramento, CA)
- …in the RDMA cluster networking domain and enable seamless, accelerated High- Performance Compute ( HPC ), Artificial Intelligence and Machine Learning advancements. ... force, driving the development and design of state-of-the-art RDMA clusters tailored specifically for AI , ML, HPC workloads. We strive to be the go-to experts in… more