• Production Engineer

    Meta (Menlo Park, CA)
    …14. 7. Networking protocols including at least one of the following: RDMA, InfiniBand , DHCP, NTP, SCP, SFTP or SNMP 15. 8. Maintaining web-based applications using ... at least one of the following: Apache, Memecached, or Squid 16. 9. Version Control tools including Mercurial 17. 10. Diagnosing and troubleshooting issues ranging from low-level hardware issues to large scale failures within datacenter clusters **Public… more
    Meta (08/01/25)
    - Related Jobs
  • Production Engineering Manager

    Meta (Menlo Park, CA)
    …large scale. 27. Experience with testing open-source hardware. 28. Familiarity with InfiniBand (IB), Remote Direct Memory Access (RDMA), and RDMA over Converged ... Ethernet (RoCE) network topologies, as well as their respective physical infrastructure. **Public Compensation:** $177,000/year to $251,000/year + bonus + equity + benefits **Industry:** Internet **Equal Opportunity:** Meta is proud to be an Equal Employment… more
    Meta (08/01/25)
    - Related Jobs
  • Software Engineer, SystemML - AI Networking

    Meta (Menlo Park, CA)
    …7. Experience with NCCL and distributed GPU performance analysis on RoCE/ Infiniband 8. PhD in Computer Science, Computer Engineering, or relevant technical ... field 9. Knowledge of GPU architectures and CUDA programming 10. Knowledge of ML, deep learning and LLM 11. Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel,… more
    Meta (08/01/25)
    - Related Jobs
  • Technical Program Manager, AI Network Infra

    Meta (Menlo Park, CA)
    …with ODMs and silicon vendors. 22. Experience in Network protocols (RoCE, InfiniBand , Ethernet). 23. Experience working with large scale distributed systems. 24. ... Experience with data center architecture & Deployment. 25. Experience with AI training and inference model deployments to physical infrastructure. **Public Compensation:** $167,000/year to $230,000/year + bonus + equity + benefits **Industry:** Internet… more
    Meta (08/01/25)
    - Related Jobs
  • Senior Software Architect, AI and HPC

    NVIDIA (Santa Clara, CA)
    …with designing communication middleware for high-performance computing systems, including InfiniBand , DPUs, Ethernet, and Shared Memory; + Experience developing and ... implementing features for compilers, optimizations for compilers, particularly Clang/LLVM, and NVIDIA compilers; + Experience implementing communications libraries, particularly MPI, OpenSHMEM, NCCL, NVSHMEM, UCX, UCC, or PGAS; + Background with CUDA… more
    NVIDIA (07/31/25)
    - Related Jobs
  • Senior AI-HPC Cluster Engineer - MLOps

    NVIDIA (Santa Clara, CA)
    …and models. + Familiarity with High-Speed Networking pertaining to HPC including InfiniBand , RDMA, RoCE and Amazon EFA. + Understanding of fast, distributed storage ... systems like Lustre and GPFS for AI/HPC workload. Experience working with deep learning frameworks including PyTorch, MegatronLM and TensorFlow. + Familiarity with metrics collection and visualization at scale with Prometheus, OpenSearch and Grafana. NVIDIA… more
    NVIDIA (07/31/25)
    - Related Jobs
  • Tech Engagement Lead - Model Builder

    NVIDIA (Santa Clara, CA)
    …This includes NVIDIA GPU architectures, DGX systems, high-performance networking ( InfiniBand ), CUDA-X libraries, NeMo frameworks, and inference libraries like ... TensorRT. Integrate these into the training and inference pipelines of large model builders. + Strengthen Partnerships: Support and strengthen technical implementation plans with partner AI engineering and researchers. Define clear technical objectives,… more
    NVIDIA (07/29/25)
    - Related Jobs
  • Engineering Manager - Rack Scale AI Systems

    NVIDIA (Santa Clara, CA)
    …cluster computing(MPI), data center design include high speed interconnect InfiniBand , Cluster Storage and Scheduling related design and/or management experience. ... + Experience with converged and hyper-converged hardware and servers. + Strong background on Windows & Linux administration. NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The… more
    NVIDIA (07/29/25)
    - Related Jobs
  • Senior Software Engineer - Simulation…

    NVIDIA (Santa Clara, CA)
    …These platforms bring together the full power of NVIDIA GPUs, NVIDIA NVLink, NVIDIA InfiniBand networking, NVIDIA Grace CPUs, and a fully optimized NVIDIA AI and HPC ... software stack. We are hiring Sr. Software Engineer who will help build simulators for our DGX Server platforms. Simulations play a significant role in building scalable systems at Speed of Light! You will work with world class engineering teams across HW and… more
    NVIDIA (07/26/25)
    - Related Jobs
  • Senior DGX Cloud Software Engineer…

    NVIDIA (Santa Clara, CA)
    …with accelerated compute and communications technologies such BlueField Networking, Infiniband topologies, NVMesh, and/or the NVIDIA Collective Communication Library ... (NCCL). + Experience working with a centralized security organization to prioritize and mitigate security risks. Prior experience in a ML/AI focused role or on a team matching specific keywords is welcome but not required. NVIDIA is leading the way in… more
    NVIDIA (07/26/25)
    - Related Jobs