-
Software Engineer II
- Microsoft Corporation (Redmond, WA)
-
Be at the forefront of Microsoft's AI revolution. The **CoreAI** organization at Microsoft builds the end-to-end Azure AI stack that powers Microsoft’s AI innovation and differentiation. We operate the global Azure AI infrastructure that runs some of the largest AI workloads on the planet. We don’t just value different perspectives - we seek them out and bring them together to better serve our customers.
Within CoreAI, the **Azure SRE Agent Platform** , designs, builds, and operates production AI agents that keep Azure’s app platforms healthy, fast, and secure. This team thrives in a very agile environment: short cycles, thin slices, feature flags, progressive delivery, and constant learning. We pair SRE fundamentals (SLOs, automation, incident response) with agentic systems (planning/execution loops, tool orchestration, evaluators, safety guardrails). If you like turning fuzzy problem statements into code that ships this week, you’ll fit right in.
We are seeking a **Software Engineer II** to help advance these capabilities in a fast, iterative environment.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities
+ **Design & implementation:** Contribute to the architecture and delivery of SRE agents and platform services - author design docs, build features, threat models, and rollout plan for scoped features.
+ **Applied AI for reliability:** Build LLM-powered detection, triage, mitigation, and post-incident learning loops; integrate evaluation frameworks and safety guardrails.
+ **SRE fundamentals at scale:** Define SLIs/SLOs and error budgets; connect them to alerting, release gates, and agent action limits to reduce MTTR and change-fail rate.
+ **Progressive delivery:** Implement feature flags, canaries, and staged rollouts; run shadow/A/B experiments with bakes in evaluations using the safe rollouts.
+ **Runbooks-as-code:** Convert on-call procedures and “pager” into policy, and automated mitigations; maintain clear playbooks and tooling.
+ **Operations ownership:** Participate in on-call, mitigate live incidents, and drive post-incident reviews with iterative hardening.
+ Optimize, debug, and establish best practices for performance, cost, and latency across agents and platform components.
+ Conduct code and design reviews to ensure adherence to standards and resolve issues proactively using telemetry and diagnostics.
+ Stay updated on AI, SRE, and Kubernetes advancements and relevant regulations while fostering collaboration across teams to meet customer and partner needs.
Qualifications
Required Qualifications:
+ Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C#, Go, or Python (one or more) for automation and services
+ OR equivalent experience.
+ 2+ years in Production/Platform Engineering for large-scale cloud services.
+ Experience with Azure (AKS, Container Apps/App Service, Functions, Service Bus/Event Grid, Storage, Cosmos DB, VNets/Private Link), IaC (Terraform or Bicep) and CI/CD (GitHub Actions or Azure DevOps).
+ 6+ months of experience with Building/operating LLM/agent systems: function calling, multi-step planning, retrieval, memory, evaluator harnesses.
Other Requirements:
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Preferred Qualifications:
+ Experience with Azure OpenAI, Azure AI Search/vector stores; prompt/response optimization; cost & latency tuning.
+ Experience with Linux, containers (Docker), orchestration (Kubernetes), and SRE fundamentals (SLI/SLO, error budgets, incident management).
+ Keeping up with latest AI research and blogs.
+ Experience with Observabilitytools: Azure Monitor/Log Analytics (KQL), OpenTelemetry, Prometheus/Grafana; canary/experiment platforms.
Software Engineering IC3 - The typical base pay range for this role across the U.S. is USD $100,600 - $199,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $131,400 - $215,400 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft posts positions for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
\#DDJL #DevDiv #CoreAI
Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .
-