Alerted.org | Alerted.org - Powering better job alerts

Senior System Administrator

TISTA Science and Technology (Austin, TX)

Apply Now

Overview

Are you a Senior Systems Administrator who would like to have a positive impact for millions of people? If so, we may have an opportunity for you!

TISTA associates enjoy above Industry Healthcare Benefits, Remote Working Options, Paid Time Off, Training/Certification opportunities, Healthcare Savings Account & Flexible Savings Account, Paid Life Insurance, Short-term & Long-term Disability, 401K Match, Tuition Reimbursement, Employee Assistance Program, Paid Holidays, Military Leave, and much more!

Responsibilities

The Senior System Administrator/Site Reliability Engineer (SRE)in the VA’s Enterprise Cloud is responsible for ensuring the resilience, performance, reliability, and compliance of mission-critical cloud services that support Veterans and VA stakeholders. This role bridges software engineering, systems engineering, and operations to deliver highly available, secure, and efficient cloud-based platforms aligned with VA’s modernization strategy and federal compliance mandates, with a focus on reliability, performance, scalability, and automation. Though day-to-day tasks vary, depending on the various organizations and their systems, generally this role’s daily work cadence follows these categories:

+ Proactively monitor system health, availability, and performance using observability tools (e.g., Prometheus, Grafana, Datadog, Splunk).

+ Respond to alerts and incidents, triage issues, and perform root cause analysis (RCA).

+ Lead on-call rotations to ensure 24/7 uptime and quick recovery from outages.

+ Document incident reports and contribute to postmortems to prevent recurrence.

+ Automate manual operational tasks such as deployments, scaling, and configuration using tools like Ansible, Terraform, or Puppet.

+ Manage infrastructure as code (IaC) to ensure consistency across environments.

+ Optimize CI/CD pipelines for reliable and repeatable software delivery.

+ Build self-healing systems to minimize downtime.

+ Conduct load and stress testing to validate system performance under peak demand.

+ Establish and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).

+ Identify and reduce sources of latency, bottlenecks, and single points of failure.

+ Work with development teams to design reliability, scalability, and fault tolerance into customer servers.

+ Patch operating systems, containers, and dependencies to address vulnerabilities.

+ Ensure compliance with organizational and regulatory requirements.

+ Implement access controls, secrets management, and least privileged principles.

+ Monitor resource utilization (CPU, memory, storage, network) to anticipate scaling needs.

+ Plan for growth by forecasting demand and preparing infrastructure accordingly.

+ Optimize cloud costs by rightsizing instances, using autoscaling, and leveraging reserved/spot instances.

+ Partner with software engineers to embed reliability practices into development.

+ Mentor teams on best practices for observability, automation, and incident handling.

+ Participate in blameless postmortems and contribute to knowledge-sharing sessions.

+ Continuously evaluate new tools and technologies to improve system reliability.

+ Design, monitor, and maintain Customer Servers to meet VA’s 99.9%+ uptime and SLA requirements across multi-cloud and hybrid environments.

+ Implement fault-tolerant and self-healing architectures leveraging automation.

+ Develop and manage observability frameworks (logging, metrics, tracing) to detect, respond to, and remediate incidents quickly.

+ Lead blameless postmortems and drive corrective actions to strengthen VAEC resilience.

+ Engineer scalable automation pipelines for provisioning, patching, and compliance (e.g., Ansible, Terraform, Puppet, GitHub Actions).

+ Reduce manual effort through self-service tools for operations teams.

+ Monitor and optimize application and infrastructure performance to meet demand from VA Medical Centers, Enterprise Data Warehouses, and end users.

+ Ensure latency, throughput, and resource utilization align with mission needs.

+ Integrate VA 6500, NIST 800-53, FedRAMP, and Zero Trust requirements into daily operations.

+ Partner with cybersecurity teams to enforce continuous ATO (cATO) practices and vulnerability remediation.

+ Collaborate with Release Management, Engineering, and Operations teams to improve change management, deployment pipelines, and reliability practices.

+ Drive the adoption of SRE principles (error budgets, SLIs, SLOs, SLAs) into VA’s IT Service Management (ITSM) processes.

+ Operate across VA’s Enterprise Cloud (VAEC), on-premises data centers, and hybrid platforms, ensuring seamless integration and interoperability.

+ Support workloads across AWS GovCloud, Microsoft Azure Government, and Oracle Cloud Infrastructure (OCI) where applicable.

+ Mission Assurance: Continuous availability of systems supporting Veterans’ health, benefits, and administrative services.

+ Operational Efficiency: Automated and standardized cloud operations reduce manual risk and speed delivery.

+ Compliance Assurance: Alignment with VA 6500, NIST, and federal mandates, minimizing audit risks.

+ Veteran-Centered Reliability: Ensure services that Veterans depend on are consistently reliable, secure, and performant.

Qualifications

+ 5 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering

+ Strong experience with Linux/Unix systems administration and troubleshooting

+ Proficient with cloud platforms (AWS and/or Azure), especially in deploying

+ Production workloads

+ Deep understanding of monitoring, metrics, alerting, and observability

+ Proficient in designing, implementing, and managing automation solutions using Ansible

+ Experience with CI/CD tools (e.g., GitHub Actions, Jenkins, GitLab CI, Azure DevOps)

+ Hands-on with containers and orchestration (Docker, Kubernetes, EKS, AKS)

+ Familiarity with networking concepts (TCP/IP, DNS, TLS, VPCs, load balancing)

+ Solid understanding of software development lifecycle (SDLC) and Agile methodologies

+ Comfortable participating in on-call rotations and handling high-priority incidents

Preferred Qualifications (optional but preferred):

+ AWS Certified SysOps Administrator or DevOps Engineer

+ Linux Certified: Azure Administrator or DevOps Engineer Expert

+ Certified Kubernetes Administrator (CKA)

+ Experience in chaos engineering, capacity modeling, or SRE tooling

+ Excellent analytical and problem-solving skills

+ Ability to work in cross-functional teams and communicate effectively with developers, operations, and leadership

+ A strong bias for automation and self-healing systems

+ Ownership mindset with a commitment to reliability and continuous improvement

Education:

+ Bachelor’s degree in computer science, electronics engineering or related technical discipline and 5+ years’ work experience

+ Eight (8) years of additional relevant experience may be substituted for education (13 years total)

Clearance:

+ The ability to pass a Tier 4/HIGH Background Investigation

Location:

+ Department of Veteran's Affairs (100% On-site)

+ Monday - Friday (8:00 AM - 4:30 PM EST Time)

+ Austin Information Technology Center (AITC)

1615 Woodward Street,

Austin, TX 78741

Apply Now

"Alerted.org

Advanced Search

Senior System Administrator

Recent Searches

Recent Jobs

Account Login

Sign Up

Forgot your password?