- Meta (Austin, TX)
- …fabric and host networking, comms lib and scheduling infrastructure. **Required Skills:** AI / HPC Systems Performance Engineer Responsibilities: 1. ... **Summary:** Meta's AI Training and Inference Infrastructure is growing exponentially...workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look… more
- Meta (Menlo Park, CA)
- …hardware and software components, co-design 15. Experience in developing or debugging AI / HPC systems , performance optimizations, including familiarity ... or supporting production hardware at scale 9. Experience in deploying and productionizing AI / HPC systems and/or related components at scale 10. Experience in… more
- NVIDIA (Santa Clara, CA)
- …to work effectively with diverse teams and individuals. + Experience analyzing and tuning performance for a variety of AI / HPC workloads. + Passion for ... GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek a...storage systems like Lustre and GPFS for AI / HPC workloads + Familiarity with deep learning… more
- NVIDIA (Santa Clara, CA)
- …designing and operating large scale storage infrastructure. + Experience analyzing and tuning performance for a variety of AI / HPC workloads. + Experience ... join us today! As a member of the GPU AI / HPC Infrastructure team, you will provide leadership...solutions to enable runs of demanding deep learning, high performance computing, and computationally intensive workloads. We seek an… more
- Argonne National Laboratory (Lemont, IL)
- …on designing the communication infrastructure for next-generation High- Performance Computing ( HPC ) and Artificial Intelligence ( AI ) systems . This ... and optimize workload-specialized interconnects and network-aware communication strategies to enhance the performance of AI and HPC workloads. + Implement… more
- NVIDIA (Santa Clara, CA)
- …looking for a technical leader to define a vision and roadmap for distributed observability systems for large-scale AI and HPC clusters and workloads and ... and visualization to spectacularly improve efficiency, performance , and productivity of AI and HPC workloads. You will lead technical teams to develop,… more
- Meta (Menlo Park, CA)
- …requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across ... fabric and host networking, comms lib and scheduling infrastructure. **Required Skills:** AI / HPC Network Engineer Responsibilities: 1. Design, develop, test and… more
- Ford Motor Company (Dearborn, MI)
- We are seeking a highly skilled and motivated HPC SRE Systems Engineer to join our growing team. You will be responsible for designing, building, and maintaining ... + Design, implement, and maintain a robust and scalable HPC infrastructure to support containerized AI /ML workloads...Troubleshoot and resolve complex technical issues related to Linux systems , networking, storage, and HPC applications. +… more
- NVIDIA (Santa Clara, CA)
- …Be Doing: + Primary responsibilities will include building and enabling robust AI / HPC infrastructure for customers + Support operational and reliability aspects ... of large-scale AI clusters, focusing on performance at scale,...in working with customers + Expertise with parallel file systems (eg Lustre, GPFS, BeeGFS, WekaIO) and high-speed interconnects… more
- General Dynamics Information Technology (Fairfax, VA)
- …High Speed Networks, Parallel File systems . . Experience running and optimizing HPC performance benchmarks or MPI codes would be a plus. . Experience ... Able to Obtain:** None **Public Trust/Other Required:** NACI (T1) **Job Family:** Systems Engineering **Skills:** High- Performance Computing ( HPC ) Systems… more