"Alerted.org

Job Title, Industry, Employer
City & State or Zip Code
20 mi
  • 0 mi
  • 5 mi
  • 10 mi
  • 20 mi
  • 50 mi
  • 100 mi
Advanced Search

Advanced Search

Cancel
Remove
+ Add search criteria
City & State or Zip Code
20 mi
  • 0 mi
  • 5 mi
  • 10 mi
  • 20 mi
  • 50 mi
  • 100 mi
Related to

  • Senior ML Storage Engineer - GPU Clusters

    NVIDIA (Santa Clara, CA)



    Apply Now

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

     

    We are seeking a highly skilled and experienced Sire Reliability Engineer to design, deploy, and manage high speed storage offering in our large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA. Join our engineering team and collaborate with researchers, AI engineers, and Infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable. The ideal candidate has a passion for operational excellence, automation, and working in a multi-cloud environment, and focus on identifying architectural changes encompassing file, block, and object storage, to cater to the requirements of an expanding cloud infrastructure. As a member of the team you will help us with the next-gen storage solutions strategic challenges we encounter with storage design for large scale, high performance workloads, evolving our private/public cloud strategy, capacity modelling, and growth planning across our global computing environment.

    What you will be doing:

    + Research and implementation of distributed storage services

    + Design and implement scalable and efficient storage solutions tailored for data-intensive AI applications, optimizing performance and cost-effectiveness.

    + Continuously improve storage infrastructure provisioning, management, observability and day to day operation through automation.

    + Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.

    + Support a globally distributed on premise and cloud environments like AWS, GCP, Azure or OCI.

    + Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.

    + Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.

    + Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows and participate in the team's on-call rotation to support critical infrastructure.

    + Drive the evaluation and integration of storage solutions with new GPU - like GB200 - and cloud technologies to improve system performance.

    What we need to see:

    + Minimum BS degree in Computer Science (or equivalent experience), with 6+ years managing high speed storage solutions deployed for GPU clusters or similar high-performance computing environments.

    + Expertise in designing, deploying, and running production-level cloud services.

    + Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must, including experience analyzing and tuning performance for a variety of AI/HPC workloads.

    + Experience architecture design and operation of storage solutions on any of the leading Cloud environment [AWS, Azure or GCP]

    + Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.

    + Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).

    + Proficient in modern CI/CD techniques, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.

    + Diligent with strong communication and documentation skills.

    Ways to stand out from the crowd:

    + Experience running large-scale Slurm/LSF and/or BCM deployments in production environments.

    + Expertise in modern container networking and storage architecture.

    + Experience with Machine Learning and Deep Learning concepts, algorithms and models

    + Consistent record to define and drive operational excellence in highly distributed, high-performance environments.

     

    NVIDIA provides competitive salaries and a comprehensive benefits package. Our engineering teams are expanding rapidly due to exceptional growth. If you're a hardworking and independent engineer with a love for technology, we want to hear from you.

     

    Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

     

    You will also be eligible for equity and benefits (https://www.nvidia.com/en-us/benefits/) .

     

    Applications for this job will be accepted at least until August 5, 2025.

     

    NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

     


    Apply Now



Recent Searches

  • Programming Operations Manager (United States)
  • Lab Processing Assistant Specimen (Minnesota)
  • R D Engineering Intern (Tennessee)
  • Management Program Analyst (California)
[X] Clear History

Recent Jobs

  • Senior ML Storage Engineer - GPU Clusters
    NVIDIA (Santa Clara, CA)
  • Computer System Manager Level 4 - Director of Application Development
    CUNY (New York, NY)
  • Manager (Day Shift), Data Center Operations (OR1)
    CoreSite (Orlando, FL)
  • Robotics Electrical Engineer
    Meta (Fremont, CA)
[X] Clear History

Account Login

Cancel
 
Forgot your password?

Not a member? Sign up

Sign Up

Cancel
 

Already have an account? Log in
Forgot your password?

Forgot your password?

Cancel
 
Enter the email associated with your account.

Already have an account? Sign in
Not a member? Sign up

© 2025 Alerted.org