Alerted.org | Alerted.org - Powering better job alerts

Senior ML Storage Engineer - GPU Clusters

NVIDIA (Santa Clara, CA)

Apply Now

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are seeking a highly skilled and experienced Sire Reliability Engineer to design, deploy, and manage high speed storage offering in our large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA. Join our engineering team and collaborate with researchers, AI engineers, and Infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable. The ideal candidate has a passion for operational excellence, automation, and working in a multi-cloud environment, and focus on identifying architectural changes encompassing file, block, and object storage, to cater to the requirements of an expanding cloud infrastructure. As a member of the team you will help us with the next-gen storage solutions strategic challenges we encounter with storage design for large scale, high performance workloads, evolving our private/public cloud strategy, capacity modelling, and growth planning across our global computing environment.

What you will be doing:

+ Research and implementation of distributed storage services

+ Design and implement scalable and efficient storage solutions tailored for data-intensive AI applications, optimizing performance and cost-effectiveness.

+ Continuously improve storage infrastructure provisioning, management, observability and day to day operation through automation.

+ Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.

+ Support a globally distributed on premise and cloud environments like AWS, GCP, Azure or OCI.

+ Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.

+ Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.

+ Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows and participate in the team's on-call rotation to support critical infrastructure.

+ Drive the evaluation and integration of storage solutions with new GPU - like GB200 - and cloud technologies to improve system performance.

What we need to see:

+ Minimum BS degree in Computer Science (or equivalent experience), with 6+ years managing high speed storage solutions deployed for GPU clusters or similar high-performance computing environments.

+ Expertise in designing, deploying, and running production-level cloud services.

+ Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must, including experience analyzing and tuning performance for a variety of AI/HPC workloads.

+ Experience architecture design and operation of storage solutions on any of the leading Cloud environment [AWS, Azure or GCP]

+ Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.

+ Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).

+ Proficient in modern CI/CD techniques, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.

+ Diligent with strong communication and documentation skills.

Ways to stand out from the crowd:

+ Experience running large-scale Slurm/LSF and/or BCM deployments in production environments.

+ Expertise in modern container networking and storage architecture.

+ Experience with Machine Learning and Deep Learning concepts, algorithms and models

+ Consistent record to define and drive operational excellence in highly distributed, high-performance environments.

NVIDIA provides competitive salaries and a comprehensive benefits package. Our engineering teams are expanding rapidly due to exceptional growth. If you're a hardworking and independent engineer with a love for technology, we want to hear from you.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits (https://www.nvidia.com/en-us/benefits/) .

Applications for this job will be accepted at least until August 5, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Apply Now

"Alerted.org

Advanced Search

Senior ML Storage Engineer - GPU Clusters

Recent Searches

Recent Jobs

Account Login

Sign Up

Forgot your password?