"Alerted.org

Job Title, Industry, Employer
City & State or Zip Code
20 mi
  • 0 mi
  • 5 mi
  • 10 mi
  • 20 mi
  • 50 mi
  • 100 mi
Advanced Search

Advanced Search

Cancel
Remove
+ Add search criteria
City & State or Zip Code
20 mi
  • 0 mi
  • 5 mi
  • 10 mi
  • 20 mi
  • 50 mi
  • 100 mi
Related to

  • Director, System Reliability Engineering

    Microsoft Corporation (Redmond, WA)



    Apply Now

    Microsoft Silicon, Cloud Hardware Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for passionate, high energy engineers to help achieve that mission.

     

    As Microsoft's Cloud business continues to grow the ability to deploy new offerings and HW infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for Cloud infrastructure reliability, improving the planning process, manufacturing, quality, delivery at scale, serviceability and sustainability. We are looking for a System Reliability Engineering Leader with a passion for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will optimize the Cloud infrastructure and its reliability.

     

    We are looking for an experienced **Director, System Reliability Engineering** who will be responsible to drive reliability performance across architecture, design, component and material selections, manufacturing and integration of datacenter hardware, ensuring that all electrical, mechanical, thermal, environmental, transportation and operational aspects along with telemetry, diagnostic and the SW/FW stack of the cloud solution are optimized throughout the lifecycle of each cloud service. The candidate will interact with Engineering, Supply Chain, Sourcing, Manufacturing & Quality, Fleet Management, Datacenter Operations, and other internal and external stakeholders.

    Responsibilities

    + Lead the design, implementation, and continuous improvement of reliability practices across our AI infrastructure. Ensure the performance, scalability, and resilience of AI systems in production environments

    + Lead the development and execution of both systems and components’ reliability engineering strategies for all Cloud platforms and services

    + Collaborate across HW and SW architecture, data engineering, and platform teams to ensure robust deployment of resilient solutions and services

    + Lead strategic innovations and develop processes to integrate industry practices to ensure efficiency in achieving high reliability and quality

    + Design and implement observability frameworks tailored to AI workloads

    + Drive incident response, root cause analysis, and postmortem processes for HW system outages or degradations

    + Establish and monitor SLAs (Availability, Node In Service, Time to restore Availability) for all cloud services, ensuring alignment with business goals and product requirements

    + Foster a culture of reliability, automation, consistency of execution and continuous improvement across engineering teams

    + Support manufacturing, datacenter operation, troubleshooting and diagnostic methods to optimize the cloud infrastructure reliability

    Qualifications

    Required/minimum qualifications

    + Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 8+ years technical engineering experience

    + OR Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 7+ years technical engineering experience

    + OR Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 5+ years technical engineering experience.

    + 5+ years of people management including resource planning, career development and performance management.

    + 5+ years of experience in system reliability, site reliability engineering, or infrastructure engineering, with at least 1 years focused on AI systems

    Other Requirements

    + Ability to meet Microsoft, customer and/or government security screening requirements is necessary for this role. These requirements include, but are not limited to, the following specialized security screenings: Microsoft Cloud Background Check.

    Preferred Qualifications:

    + Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 12+ years technical engineering experience

    + OR Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 10+ years technical engineering experience

    + OR Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 7+ years technical engineering experience.

    + Experience in AI lifecycle, including model training, deployment, monitoring, and retraining

    + Experience in cloud fleet management, telemetry, diagnostic and troubleshooting of IT systems

    + Experience and knowledge in the server industry product development process

    + Experience in managing cross-functional teams and large-scale distributed systems

    + Experience with system reliability, manufacturing process and datacenter operations, leading continuous improvements through automation

    + Experience with liquid cooling infrastructure for IT racks

     

    Reliability Engineering M5 - The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.

     

    Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

     

    Microsoft will accept applications for the role until May 26th, 2025.

     

    \#azurehwjobs \#HIFE #Azure #Cloud #Hardware #AHSI

     

    Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .

     


    Apply Now



Recent Searches

  • Recreation Program Coordinator (United States)
  • Machine Learning Data Analytics (United States)
  • Client Security Engineer (Massachusetts)
  • staff accountant needed high (United States)
[X] Clear History

Recent Jobs

  • Director, System Reliability Engineering
    Microsoft Corporation (Redmond, WA)
[X] Clear History

Account Login

Cancel
 
Forgot your password?

Not a member? Sign up

Sign Up

Cancel
 

Already have an account? Log in
Forgot your password?

Forgot your password?

Cancel
 
Enter the email associated with your account.

Already have an account? Sign in
Not a member? Sign up

© 2025 Alerted.org