• Senior GPU and HPC Infrastructure Engineer - DGX…

    NVIDIA (Santa Clara, CA)
    …of cluster management systems (Kubernetes, SLURM) + Understanding of performance, security and reliability in complex distributed systems. Familiarity with system ... level architecture , data synchronization, fault tolerance and state management. Ways...in Machine Learning Operations. Hands-on experience with Bright Cluster Manager . + Hands-on experience developing and/or operating hardware fleet… more
    NVIDIA (07/10/25)
    - Related Jobs