"Alerted.org

Job Title, Industry, Employer
City & State or Zip Code
20 mi
  • 0 mi
  • 5 mi
  • 10 mi
  • 20 mi
  • 50 mi
  • 100 mi
Advanced Search

Advanced Search

Cancel
Remove
+ Add search criteria
City & State or Zip Code
20 mi
  • 0 mi
  • 5 mi
  • 10 mi
  • 20 mi
  • 50 mi
  • 100 mi
Related to

  • Site Reliability Engineer (SRE) - Data Center

    Insight Global (Memphis, TN)



    Apply Now

    Job Description

    As a Data Center Site Reliability Engineer (SRE) at client, you will play a pivotal role in ensuring the reliability, scalability, and performance of our advanced data center infrastructure, including high-density GPU clusters that support large-scale AI/ML workloads. This infrastructure powers mission-critical computing environments and cutting-edge applications, requiring exceptional operational excellence and resilience.

     

    This is a hands-on technical role in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of data center operations, distributed systems, and software reliability.

    Core Responsibilities:

    • Maintain and improve the reliability and uptime of client’s on-premises and cloud-based environments, including HPC and GPU clusters.

    • Design, implement, and manage monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, PagerDuty).

    • Develop and maintain infrastructure-as-code (Pulumi, Terraform) and CI/CD pipelines (Buildkite, ArgoCD).

    • Participate in on-call rotations, respond to incidents, perform root cause analysis, and drive post-mortem processes.

    • Analyze system performance, forecast capacity needs, and optimize resource utilization for compute, storage, and networking.

    • Collaborate with hardware, networking, and software teams to design resilient, scalable solutions (e.g., RDMA fabrics, liquid cooling).

    • Create and maintain documentation and SOPs, ensuring clear shift handoffs and incident reporting.

    • Support production rollouts and canary deployments, including rollback strategies.

    • Identify and mitigate bottlenecks in traffic flow, storage APIs, and network performance.

     

    We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to [email protected] learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.

    Skills and Requirements

    • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).

    • 5+ years in SRE, data center operations, or large-scale infrastructure management.

    • Experience in HPC environments and familiarity with hardware stack troubleshooting.

    • Strong knowledge of Kubernetes (on-prem and cloud), infrastructure-as-code, and CI/CD tools.

    • Proficiency in Python, Bash, and basic systems programming concepts.

    • Understanding of database queries, storage APIs, NFS, and object storage.

    • Proven experience with incident response, including major outages and post-mortem analysis.

    • Strong troubleshooting skills across hardware, networking, and distributed systems. • Experience supporting AI/ML workloads or high-density compute environments.

    • Familiarity with data center electrical, cooling, and network systems.

    • Certifications in SRE, Kubernetes, or data center operations.

    • Knowledge of traffic management, load balancers, CPU optimization, memory leak detection, and network bottleneck resolution.

    • Hands-on experience with Docker, GitHub, and open-source observability tools (e.g., Prometheus, Grafana, Kronosphere).

    • Deep understanding of hardware stack and ability to analyze system logs.

    • Kubernetes expertise.

    • Dashboard monitoring and interpretation.

    • Single-node optimization and traffic flow analysis.

    PLUS:

    Someone who has handled major outages, understands production rollouts and rollback strategies, and can optimize cloud and on-prem environments.

     


    Apply Now



Recent Searches

  • Data Scientist III Clinical (Illinois)
  • Capture Affordability Sr Program (Washington, DC)
  • Operations Engineer Rotational Program (United States)
  • System Administrator MUOS SATCOM (United States)
[X] Clear History

Recent Jobs

  • Site Reliability Engineer (SRE) - Data Center
    Insight Global (Memphis, TN)
  • Team Leader, Technical Support - 1st Shift
    Stanley Black and Decker (East Longmeadow, MA)
  • ASIC Modem Design Engineer, Project Kuiper
    Amazon (San Diego, CA)
[X] Clear History

Account Login

Cancel
 
Forgot your password?

Not a member? Sign up

Sign Up

Cancel
 

Already have an account? Log in
Forgot your password?

Forgot your password?

Cancel
 
Enter the email associated with your account.

Already have an account? Sign in
Not a member? Sign up

© 2025 Alerted.org