- 
        Senior Software Engineer
- Microsoft Corporation (Redmond, WA)
- 
             Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team is responsible for managing the core platform & fleet of AI High Performance Computing products that customers use to run their most performant and demanding workloads. The AI Customer Experience (AICE) engineering team within the HPC & AI Eng. team is on the frontlines managing the flagship supercomputers used by top tier AI customers that enable breakthroughs such as ChatGPT and are highlighted in Top500, MLPerf and Graph500 rankings. Operating at supercomputing scale requires specialized tools and techniques to ensure system reliability, runtime performance, and job health, while continuing to meet customer Service Level Agreements (SLAs). As a Senior Supercomputing Software & Systems Engineer, you will be responsible for diagnosing & troubleshooting the largest scale supercomputing systems across the infrastructure stack (GPU hardware, networking, datacenter and core software). In this role, you will develop and apply advanced tools, identify operational gaps, and implement features that support the smooth operation of cloud-native supercomputers. This opportunity will give you hands-on experience developing capabilities to manage the largest scale of supercomputers delivered to our customers. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Responsibilities + Collaborates with appropriate stakeholders to determine user requirements for a scenario. + Drives identification of dependencies and the development of design documents for a product, application, service, or platform. + Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI). + Leverages subject-matter expertise of product features and partners with appropriate stakeholders (e.g., project managers) to drive a workgroup's project plans, release plans, and work items. + Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate. + Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale. Qualifications Required Qualifications: + Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python + OR equivalent experience. + 3+ years of experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure. + 2+ years of specialized experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure. Other Requirements: + Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: + Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter. Preferred Qualifications: + Bachelor's Degree in Computer Science + OR related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python + OR Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python + OR equivalent experience. + 1+ year(s) of experience running and troubleshooting machine learning workloads on Graphics Processing Unit (GPU)-based High Performance Computing (HPC) systems, including familiarity with the HPC software stack. + 1+ year(s) of experience with cloud computing, virtualization, and container technologies. Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay Microsoft will accept applications for the role until November 13, 2025. \#azurecorejobs Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) . 
 
 
- 
        
Recent Searches
- Summer 2026 Applied Gen (New York)
- Sterile Process Tech PRN (Texas)
- Principal Data Scientist Deepfake (United States)
- Associate Director Project Management (Ohio)
Recent Jobs
- 
                
                    Senior Software Engineer
                
                - Microsoft Corporation (Redmond, WA)
- 
                
                    Health Worker 2 (Interpreter Services) - San Francisco Department of Public Health
                
                - City and County of San Francisco (San Francisco, CA)
- 
                
                    Senior Corporate Electrical Engineer
                
                - Amcor (Oshkosh, WI)
- 
                
                    Architecture Intern - Global Facilities (Newton)
                
                - Burns & McDonnell (Newton, MA)