-
Senior SRE
- IBM (San Jose, CA)
-
Introduction
A career in IBM Software means you’ll be part of a team that transforms our customer’s challenges into solutions. Seeking new possibilities and always staying curious, we are a team dedicated to creating the world’s leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
IBM’s product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.
Your role and responsibilities
We are seeking a Sr Customer Support / SRE to join our team who is responsible for delivering Astra Streaming (Apache Pulsar as a Service). You will help our users succeed by resolving complex incidents, improving service reliability, and driving operational excellence across environments.
You will work closely with engineering, product, and customer support teams to ensure the Astra Streaming platform runs with high availability, low latency, and predictable performance in support of meeting and exceeding enterprise workload expectations.
Key Responsibilities
* Serve as Tier2/Tier3 escalation point for customer-reported incidents, performance issues, and operational anomalies.
* Troubleshoot issues across the full-stack
* Develop and maintain runbooks, monitoring dashboards, altering rules
* Participate in and improve on-call rotation, including leading incident response and post-mortems when necessary
* Collaborate with Engineering to identify root causes and drive fixes for long-term improvements
* Implement SLOs, SLIs and error budgets to ensure platform reliability aligns with customer expectations
* Automate common tasks (toil)
* Contribute to and lead observability and telemetry improvements (Prometheus, Grafana, Thanos, or equivalent).
* Provide detailed and empathetic customer communication during incidents and post-incident reviews.
* Act as a voice of the customer in reliability, scalability, and usability discussions
* Mentor junior support and operations engineers
Success in this Role
In the first six months, success means:
* Handling escalations independently and guiding complex incident responses.
* Improving MTTR through new automation or monitoring enhancements.
* Earning customer trust by delivering transparent communication and reliable resolution
* Identifying recurring failure modes and driving engineering changes to eliminate them.
Required technical and professional expertise
* 5+ years of experience in SRE, DevOps, or Production Engineering for large-scale distributed systems.
* Deep understanding of Apache Pulsar, Apache Bookeeper, or similar messaging systems (Kafka, Rabbit MQ)
* Experience operating Pulsar clusters in Kubernetes in public clouds
* Solid troubleshooting skills across Linux, Networking, JVM based applications and Containers / Kubernetes as a service.
* Strong knowledge of monitoring, logging, and tracing tools (Prometheus, Grafana, Splunk, etc)
Preferred technical and professional experience
* Experience contributing to Opensource Apache Pulsar or Bookeeper
* Familiarity with multi-tenant architectures and managed-service operations
* Experience with IaC and GitOps workflows
IBM is committed to creating a diverse environment and is proud to be an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender, gender identity or expression, sexual orientation, national origin, caste, genetics, pregnancy, disability, neurodivergence, age, veteran status, or other characteristics. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.
-