-
Site Reliability Engineer (Java focused) Sr…
- ERCOT (Taylor, TX)
-
At ERCOT, our diverse and dynamic work environment provides a platform on which employees can work together to build the future of the Texas power grid and wholesale market utilizing the latest technologies and resources. We encourage you to join our talented, dedicated workforce to develop world-class solutions for today and tomorrow’s energy challenges while learning new skills and growing your career.
ERCOT is committed to fostering inclusion at all levels of our company. It is the cornerstone of our corporate values of accountability, leadership, innovation, trust, and expertise. We know that individuals with a wide variety of talents, ideas, and experiences propel the innovation that drives our success. An inclusive and diverse workforce strengthens us and allows for a collaborative environment to solve the challenges that face our industry today and in the future.
JOB SUMMARY
ERCOT is seeking a Senior or Lead Site Reliability Engineer (SRE) with strong Java application expertise to ensure the availability, performance, and reliability of mission-critical systems. This role will follow ERCOT specific SRE process and principles which includes managing site failover between 2 datacenters as well as treating Azure as an extended datacenter in the future. You will work deeply with Java codebases while owning production health and operational excellence.
JOB DUTIES INCLUDE:
Core Responsibilities
- Own reliability, availability, latency, and scalability of Java-based systems
- Define and track SLIs, SLOs, and error budgets
- Design and maintain monitoring, alerting, logging, and dashboards
- Lead incident response and conduct blameless postmortems
- Reduce operational toil through automation and tooling
- Review system designs for reliability and failure modes
- (Lead level) Establish reliability standards and mentor engineers
Java & Application Responsibilities
- Debug and improve Java applications (Spring Boot preferred)
- Perform JVM tuning and performance analysis
- Diagnose failures across databases, messaging, and APIs
- Partner with development teams to improve resilience and recovery
On-Call & Incident Response
- Participate in an on-call rotation for supported services
- Focus on engineering solutions rather than repetitive manual work
- Emphasis on post-incident learning and automation
- Toil is tracked and actively reduced
EXPERIENCE:
- 5+ years (Senior) or 10+ years (Lead) in SRE, DevOps, or Production Engineering
- Strong Java experience (Spring-based systems)
- Experience with distributed, high-availability systems
- Expertise in observability tools (metrics, logs, traces)
- CI/CD experience (Git, Maven, Jenkins)
- Strong cross-layer debugging skills
-CS or related degree required
PREFERRED
- Python
- Kubernetes or OpenShift
- Microsoft Azure
- Kafka or ActiveMQ
- Infrastructure automation (Terraform, Azure Resource Manager, Ansible, Liquibase)
- Chaos or load testing experience
Observability & Production Tooling
- Strong hands-on experience with observability and APM platforms such as Splunk, Dynatrace, DataDog
- Expertise in using Metrics, Logs, Traces, and Profiling (MLTP) to troubleshoot complex production incidents
- Experience with Grafana LGTM Stack for Observability (Loki - for logs, Grafana - for dashboards and visualization, Tempo - for traces, and Mimir - for metrics)
- Experience correlating application performance data with system behavior to identify root causes and prevent recurrence
WORK LOCATION – Taylor, TX:
+ Employees will be required to be on-site in Taylor, TX at minimum 2 days per week, or more, as needed based on the business needs as determined by management.
+ On-site schedules are flexible or may be rotated based on business needs as determined by the Manager.
+ Remote work is required to be performed from your Texas residence.
+ Employees may opt to work on-site more than required or 100% of the time.
The foregoing description reflects the minimum qualifications and the essential functions of the position that must be performed proficiently with or without reasonable accommodation for individuals with disabilities. It is not an exhaustive list of the duties expected to be performed, and management may, at its discretion, revise or require that other or different tasks be performed as assigned. This job description is not intended to create a contract of employment with ERCOT. Both ERCOT and the employee may exercise their employment-at-will rights at any time. \#LI-IV1
ERCOT is firmly committed to equal employment for all qualified persons without regard to race, sex, medical condition, religion, age, creed, national origin, citizenship status, marital status, sexual orientation, physical or mental disability, ancestry, veteran status, genetic information or any other protected category under federal, state or local law.
Expected Salary Range:
$99,230 - $168,715
-
Recent Jobs
-
Site Reliability Engineer (Java focused) Sr or Lead
- ERCOT (Taylor, TX)
-
Senior Software Engineer - Life Sciences - Full Stack
- Oracle (Santa Fe, NM)