-
Sr. HPC Systems Engineer (IT@JH Research…
- Johns Hopkins University (Baltimore, MD)
-
IT@JH Research Computing is seeking a **_Sr. HPC Systems Engineer_** who will design, build, and maintain advanced high-performance computing environments supporting Johns Hopkins University’s research mission. This position focuses on the reliable operation, configuration, and optimization of HPC and AI systems, including multi-node CPU and GPU clusters, high-speed InfiniBand and Ethernet networks, and large-scale parallel and object storage. The engineer implements and automates secure, efficient, and reproducible computing platforms used by faculty, researchers, and students across diverse scientific disciplines. Assignments include both ticket-based support and project-based deployments. The role operates with moderate independence, collaborating closely with the IT Architect, Research Computing, and reporting to the IT Manager for Research Computing to ensure scalable, sustainable, and high-performance systems that enable cutting-edge scientific discovery.
Specific Duties & Responsibilities
+ Support and administer production systems used by researchers and Research Centers.
+ Provide technical leadership/project management for system configuration, implementation, management, and user support for both new and existing systems.
+ Research and recommend new functionality for HPC management and administration tools by exploring system-wide impacts, working with functional users to define current and future processes.
+ Expertise with architecting, operating, and debugging large scale HPC network and storage infrastructure, including MPI, NCCL, RDMA, Infiniband, and parallel file systems
+ Work with scientific support specialists to assign tasks and provide oversight as appropriate to HPC engineering team to support scientific researchers who use a broad spectrum of applications from diverse fields.
+ Analyze results of server monitoring and implement changes to improve performance, processing, and utilization.
+ Propose, maintain, and enforce policies, practices and security procedures.
+ Provide break/fix support, setup/installation support, escalation support, and solutions support.
+ Collaborate closely with a variety of stakeholders, both internal and external, on all aspects of projects.
+ Other duties as assigned.
_In Addition to the Duties Described Above_
+ Deploy, configure, and maintain large-scale Linux-based HPC clusters comprising CPU and GPU nodes, high-speed interconnects, and parallel file systems.
+ Implement and optimize workload schedulers (Slurm) and job submission policies to maximize system throughput and fair-share usage.
+ Administer and monitor distributed storage systems (GPFS, Lustre, WekaFS, Ceph, MinIO) to ensure reliability and performance across multi-petabyte environments.
+ Maintain high-speed fabric and network infrastructure (Infiniband, Ethernet) to support low-latency data transfer and MPI workloads.
+ Support research groups in deploying, testing, and optimizing scientific applications and AI/ML workflows on shared computing resources.
+ Develop and maintain automation and monitoring frameworks for system provisioning, metrics collection, and alerting (Prometheus, Grafana, ELK).
+ Participate in capacity planning, hardware lifecycle management, and evaluation of new technologies in collaboration with architects and management.
+ Ensure security and compliance through configuration hardening, patch management, and integration with campus identity and access control systems.
+ Document system designs, procedures, and troubleshooting guides to support knowledge transfer and team continuity.
+ Contribute to a collaborative engineering culture that emphasizes service quality, innovation, and continuous improvement in research computing operations.
+ Participate in on-call rotation to ensure high availability and timely response to system alerts.
Minimum Qualifications
+ Bachelor’s Degree.
+ Six years related experience.
+ Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula.
Preferred Qualifications
+ Eight + years of experience in high-performance computing systems administration or engineering, including experience with cluster management, workload scheduling (e.g., Slurm), and distributed or parallel storage.
+ Deep proficiency in Linux systems administration, configuration management (Ansible, Puppet, or Salt), performance monitoring, and tuning for HPC workloads.
+ Experience with high-speed interconnects (Infiniband, 100/400 Gb Ethernet) and parallel file systems (e.g., GPFS, Lustre, BeeGFS, or WekaFS).
+ Working knowledge of containerization and orchestration (Singularity, Docker, Kubernetes for HPC).
+ Ability to automate deployments and routine operations through scripting (Bash, Python).
+ Familiarity with data-center operations, GPU acceleration, and research software environments (e.g., CUDA, MPI, AI/ML frameworks).
+ Strong analytical and troubleshooting skills, with proven ability to support complex research workloads in multi-user, multi-tenant environments.
+ Experience collaborating with faculty and research groups to translate scientific requirements into practical and performant computing solutions.
Technical Skills & Expected Level of Proficiency
+ Automation - Authority
+ Cloud Infrastructur - Authority
+ Cloud Migration - Authority
+ Cloud Security - Authority
+ Cloud Strategy - Authority
+ Job Scheduling Systems - Authority
+ Operating Software - Authority
+ Scripting - Authority
+ Software Development Life Cycle - Authority
+ Systems Architecture - Authority
+ Systems Analysis - Authority
+ Systems Configuration - Authority
+ Systems Design - Authority
+ Systems Development - Authority
+ Systems Engineering - Authority
+ Systems Integration - Authority
Classified Title: Sr. HPC Systems Engineer
Job Posting Title (Working Title): Sr. HPC Systems Engineer (IT@JH Research Computing)
Role/Level/Range: ATP/04/PF
Starting Salary Range: $85,500 - $149,800 Annually (Commensurate w/exp.)
Employee group: Full Time
Schedule: Mon-Fri, 8:30am-5pm
FLSA Status: Exempt
Department name: IT@JH Research Computing
Personnel area: University Administration
Equal Opportunity Employer
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.
-
Recent Jobs
-
Sr. HPC Systems Engineer (IT@JH Research Computing) - #Staff
- Johns Hopkins University (Baltimore, MD)
-
Associate General Counsel
- Insight Global (Boca Raton, FL)