-
HPC Sr. Systems Administrator (IT@JH Research…
- Johns Hopkins University (Baltimore, MD)
-
IT@JH Research Computing is seeking a **_HPC Sr. Systems Administrator_** who will support the daily operation and maintenance of Johns Hopkins University’s high-performance computing and AI infrastructure. This role ensures the reliability, availability, and security of compute, storage, and network resources used by faculty, students, and research staff. Responsibilities include monitoring system health, managing user accounts, deploying software, applying patches, and assisting in node provisioning and configuration. Work assignments are a mix of ticket-based support, routine maintenance, and project participation under the direction of senior engineers. The position collaborates closely with senior HPC staff to deliver stable, efficient, and well-documented systems that underpin advanced computational research across disciplines.
Specific Duties & Responsibilities
_Systems Analysis/Design (Environment/Platform)_
+ Design business, clinical, education, or infrastructure solutions by meeting with customers to observe and understand current processes and the issue related to those processes. Provide written documentation and diagrams of findings to share with the client and other IT colleagues.
+ Design solutions that conform to institutional policies, standards, and guidelines, and infrastructure environment and to vendor and industry best practices to deliver a quality product.
+ Recommend infrastructure applications that reside between end user applications and hardware operating systems by working with vendors, customers, and other sources (i.e., open source or Internet2 initiatives) to provide configurable tools to the customers.
_Install & Configure_
+ Install and configure server hardware and operating systems by following technical documentation to provide a working product.
+ Evaluate, implement, and manage appropriate software and hardware solutions by using best practices for the environment to ensure system integrity.
+ Install and configure infrastructure applications by following product installation and configuration directions and industry best practices to deliver a solution to the customers.
+ Implement a schedule of system backups and archive operations by using best practices for the environment to ensure data/media recoverability.
_Maintain & Troubleshoot_
+ Provide server level administration (manage HW/SW, maintenance, upgrades and patches, account maintenance, backups and recoveries and assist users) by following documented procedures to ensure a stable environment.
+ Monitor and tune the system by following documentation and procedures to achieve optimum performance levels.
+ Develop scripts and solutions by using departmental standards to automate systems management.
+ Perform system software upgrades including planning and scheduling, testing, and coordination by following documentation and departmental standards to provide a stable product for the environment.
+ Audit and maintain user access and authorization by following access and authorization documentation to provide for system security.
+ Generate and maintain periodic and ongoing system specific reports by using appropriate tools to assess system performance, integrity and capacity in order to deliver a stable environment to the users.
+ Follow and maintain IT security awareness and best practices by understanding security principles as they pertain to environments supported in order to deliver secure solutions to customers.
+ Utilize system management and monitoring tools and incident tracking systems by following documentation and standards to detect incidents, take corrective actions, and determine root cause.
+ Monitor changes and resolve any incidents by responding to problems as they occur, by reviewing all processing and output of the newly implemented solution, and by proactively ensuring the solution works successfully in order to satisfy the customer requirements and to provide a smooth transition to the new solution.
_Project Collaboration & Lifecycle Participation_
+ Implement changes while adhering to the change management policies and procedures in order to deliver a successful solution to the customer. Communicate to all parties the nature, significance, and risk factors.
+ Evaluate vendor proposals by reviewing requirements for the product to select the most appropriate vendor.
+ Assist vendors, consultants, and inside Enterprise groups in developing applications by meeting with the team on a regular basis to deliver quality products to customers.
+ Participate in scheduled project team meetings by attending all meetings to provide input to the project team.
+ Create and maintain documentation by writing audience-appropriate materials to serve as technical and/or end user reference.
+ Test all changes by using the appropriate test scenarios to ensure all delivered solutions work as expected and errors are handled in a meaningful way. Contribute and make recommendations to the development of test scenarios.
+ Other duties as assigned.
_In Addition to the Duties Described Above_
+ Monitor cluster health, performance, and utilization using standard tools (Prometheus, Grafana, ELK, Nagios).
+ Perform user account management, permissions, and access control in a multi-tenant Linux environment.
+ Install and update operating systems, drivers, and software modules on compute and storage nodes.
+ Assist with provisioning new nodes, applying BIOS and firmware updates, and validating configurations.
+ Support the maintenance of storage systems, including quota management and backup verification.
+ Work with senior staff to troubleshoot hardware, network, and application issues.
+ Document configuration changes, operational procedures, and troubleshooting workflows.
+ Participate in on-call rotation to ensure high availability and timely response to system alerts.
+ Help users with environment setup, job submission, and debugging within scheduler queues.
+ Contribute to automation efforts and continuous improvement of cluster operations under guidance from senior engineers.
+ Support compliance and security standards through consistent patching, auditing, and logging practices.
Minimum Qualifications
+ Bachelor's Degree.
+ Three years of related experience.
+ Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula.
Preferred Qualifications
+ Two to five years of experience administering Linux systems in a high-performance computing, academic, or enterprise environment.
+ Familiarity with HPC concepts such as cluster scheduling, distributed storage, and job submission (experience with Slurm or similar is a plus).
+ Working knowledge of configuration management tools (Ansible, Puppet) and scripting languages (Bash, Python).
+ Basic understanding of networking, system security, and user authentication (LDAP, Kerberos, Active Directory).
+ Experience installing and maintaining software packages, libraries, and user environments.
+ Comfortable supporting researchers and users in troubleshooting job failures, file system issues, or application environment problems.
+ Awareness of GPU computing, containerization (Apptainer/Singularity), and parallel file systems desirable but not required.
+ Strong documentation and communication skills with a commitment to operational excellence and teamwork.
Technical Skills & Expected Level of Proficiency
+ Automation - Advanced
+ Directory Services: Advanced
+ Operating Software - Advanced
+ Scripting - Advanced
+ Software Development Life Cycle - Advanced
+ Systems Analysis - Advanced
+ Systems Configuration - Advanced
+ Systems Development - Advanced
+ Systems Integration - Advanced
Classified Title: Sr. Systems Administrator
Job Posting Title (Working Title): HPC Sr. Systems Administrator (IT@JH Research Computing)
Role/Level/Range: ATP/04/PD
Starting Salary Range: $62,900 - $110,100 Annually (Commensurate w/exp.)
Employee group: Full Time
Schedule: Mon-Fri, 8:30am-5pm
FLSA Status: Exempt
Department name: IT@JH Research Computing
Personnel area: University Administration
Equal Opportunity Employer
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.
-
Recent Jobs
-
HPC Sr. Systems Administrator (IT@JH Research Computing) - #Staff
- Johns Hopkins University (Baltimore, MD)