Alerted.org | Alerted.org - Powering better job alerts

Site Reliability Engineer (SRE) - Data Center

Insight Global (Memphis, TN)

Apply Now

Job Description

As a Data Center Site Reliability Engineer (SRE) at client, you will play a pivotal role in ensuring the reliability, scalability, and performance of our advanced data center infrastructure, including high-density GPU clusters that support large-scale AI/ML workloads. This infrastructure powers mission-critical computing environments and cutting-edge applications, requiring exceptional operational excellence and resilience.

This is a hands-on technical role in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of data center operations, distributed systems, and software reliability.

Core Responsibilities:

• Maintain and improve the reliability and uptime of client’s on-premises and cloud-based environments, including HPC and GPU clusters.

• Design, implement, and manage monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, PagerDuty).

• Develop and maintain infrastructure-as-code (Pulumi, Terraform) and CI/CD pipelines (Buildkite, ArgoCD).

• Participate in on-call rotations, respond to incidents, perform root cause analysis, and drive post-mortem processes.

• Analyze system performance, forecast capacity needs, and optimize resource utilization for compute, storage, and networking.

• Collaborate with hardware, networking, and software teams to design resilient, scalable solutions (e.g., RDMA fabrics, liquid cooling).

• Create and maintain documentation and SOPs, ensuring clear shift handoffs and incident reporting.

• Support production rollouts and canary deployments, including rollback strategies.

• Identify and mitigate bottlenecks in traffic flow, storage APIs, and network performance.

We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to [email protected] learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.

Skills and Requirements

• Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).

• 5+ years in SRE, data center operations, or large-scale infrastructure management.

• Experience in HPC environments and familiarity with hardware stack troubleshooting.

• Strong knowledge of Kubernetes (on-prem and cloud), infrastructure-as-code, and CI/CD tools.

• Proficiency in Python, Bash, and basic systems programming concepts.

• Understanding of database queries, storage APIs, NFS, and object storage.

• Proven experience with incident response, including major outages and post-mortem analysis.

• Strong troubleshooting skills across hardware, networking, and distributed systems. • Experience supporting AI/ML workloads or high-density compute environments.

• Familiarity with data center electrical, cooling, and network systems.

• Certifications in SRE, Kubernetes, or data center operations.

• Knowledge of traffic management, load balancers, CPU optimization, memory leak detection, and network bottleneck resolution.

• Hands-on experience with Docker, GitHub, and open-source observability tools (e.g., Prometheus, Grafana, Kronosphere).

• Deep understanding of hardware stack and ability to analyze system logs.

• Kubernetes expertise.

• Dashboard monitoring and interpretation.

• Single-node optimization and traffic flow analysis.

PLUS:

Someone who has handled major outages, understands production rollouts and rollback strategies, and can optimize cloud and on-prem environments.

Apply Now

"Alerted.org

Advanced Search

Site Reliability Engineer (SRE) - Data Center

Recent Searches

Recent Jobs

Account Login

Sign Up

Forgot your password?