-
HPC Sr. Scientific Software Engineer (IT@JH…
- Johns Hopkins University (Baltimore, MD)
-
IT@JH Research Computing is seeking a **_HPC Sr. Scientific Software Engineer_** who will design, build, and support Johns Hopkins University’s high-performance computing and AI research infrastructure. This role integrates elements of both systems and software engineering, ensuring scalable, secure, and reproducible environments for scientific and data-intensive research. The Engineer develops and automates system and application workflows across CPU/GPU clusters, parallel storage, and hybrid cloud platforms. Responsibilities include configuring and optimizing large-scale Linux environments, implementing job scheduling and orchestration frameworks, containerizing applications, and supporting researchers in optimizing performance and reproducibility. Work combines project-based engineering with operational support, requiring both independent problem-solving and close collaboration with the Research Computing team and faculty stakeholders.
Specific Duties & Responsibilities
Software Deployment and Design
+ Develop and refine deployment strategies for scientific software on HPC and AI systems.
+ Design computational workflows, selecting optimal software configurations, and utilizing tools like Ansible for automation.
+ Assist teams in implementing, tuning, and optimizing AI models and gateway applications (e.g., XDMoD, Coldfront, Open OnDemand, CryoSPARC Live, SBGrid, AI Agents).
_Performance Optimization_
+ Analyze and optimize the performance of AI models and HPC applications, focusing on GPU-enabled computing.
+ Implement parallel processing, distributed computing, and resource management techniques for efficient job execution.
_Integration and Optimization_
+ Develop, debug, and maintain software tools, libraries, and frameworks supporting HPC and AI workloads.
+ Collaborate with the system team and software vendors (e.g., NVIDIA, Intel, Matlab) to optimize systems for maximum performance.
+ Utilize CUDA, DNN, TensorRT, and Intel Compilers to enhance system performance.
_HPC Scientific Software Support_
+ Manage and support scientific software deployment across HPC, cloud-based, and colocation facilities.
+ Oversee installation, configuration, and maintenance of HPC packages with tools like CMake, Make, EasyBuild, Spack, and Lua module files
_Collaboration and Mentorship_
+ Work closely with cross-functional teams, including researchers, data scientists, and software developers, to address complex HPC/AI challenges.
+ Mentor junior engineers and foster a culture of continuous learning.
_Technical Support and Training Workshops and Troubleshooting_
+ Resolve complex technical issues and perform root cause analysis for HPC/AI software challenges.
+ Implement effective solutions to prevent recurrence and improve system reliability
+ Provide training workshops for researchers and students, focusing on troubleshooting, optimizing workflows, and effectively using HPC systems.
_Learning and Development_
+ Stay current with advances in HPC and AI technologies and methodologies.
+ Incorporate new research findings into existing systems to improve performance and capabilities.
_Container Orchestration_
+ Develop and manage container orchestration strategies to ensure scalability, reliability, and security of applications.
+ Oversee the container lifecycle from creation and deployment to scaling and removal.
_Documentation and Compliance_
+ Create comprehensive documentation for system designs, performance metrics, and project status.
+ Ensure compliance with security and regulatory standards for all HPC and AI systems.
_In Addition to the Duties Described Above_
+ Design, deploy, and maintain large-scale Linux HPC clusters with CPU/GPU resources, high-speed networks, and distributed storage.
+ Develop and maintain automation frameworks for provisioning, monitoring, and software lifecycle management.
+ Implement and optimize job scheduling, container orchestration, and workflow automation tools to support diverse research workloads.
+ Collaborate with faculty and research teams to parallelize, containerize, and scale computational workflows for multi-GPU and distributed environments.
+ Benchmark and tune application performance across architectures, documenting findings and sharing best practices.
+ Integrate and support AI/ML frameworks, scientific libraries, and workflow engines (Snakemake, Nextflow, Dask, Ray).
+ Ensure system and application reliability through proactive monitoring (Prometheus, Grafana, ELK) and incident response participation.
+ Support reproducibility and FAIR data principles through version-controlled, containerized environments.
+ Contribute to documentation, training materials, and technical guidance to enhance user experience and self-service capabilities.
+ Participate in evaluation and adoption of new technologies to advance performance, efficiency, and sustainability in research computing.
Minimum Qualifications
+ PhD in a quantitative discipline.
+ Five years of experience in HPC user support, software deployment, and performance optimization within an academic or research environment.
+ Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula.
Preferred Qualifications
+ Eight + years of professional experience in high-performance computing, large-scale systems, or research software engineering.
+ Deep proficiency in Linux systems administration, performance tuning, and automation tools (Ansible, Terraform, Jenkins, or similar).
+ Experience with cluster management, workload schedulers (e.g., Slurm), and distributed or parallel file systems (e.g., GPFS, Lustre, WekaFS, Ceph).
+ Strong background in programming or scripting (Python, Bash, C/C++, Go, or Rust).
+ Familiarity with containerization and orchestration technologies used in HPC (Singularity, Apptainer, Docker, Kubernetes).
+ Understanding of high-speed interconnects (InfiniBand, 100/400 Gb Ethernet) and storage/data access patterns for AI and analytics.
+ Experience developing or maintaining CI/CD pipelines and module environments (Lmod/Spack) for research software.
+ Knowledge of GPU computing (CUDA, ROCm), MPI/OpenMP, and AI/ML frameworks.
+ Demonstrated ability to collaborate with researchers on performance optimization, workflow design, and reproducible computing.
Classified Title: HPC Sr. Scientific Software Engineer
Job Posting Title (Working Title): HPC Sr. Scientific Software Engineer (IT@JH Research Computing)
Role/Level/Range: ATP/04/PG
Starting Salary Range: $99,800 - $175,000 Annually (Commensurate w/exp.)
Employee group: Full Time
Schedule: Mon-Fri, 8:30am-5pm
FLSA Status: Exempt
Department name: IT@JH Research Computing
Personnel area: University Administration
-
Recent Searches
- SOP Developer Level 4 (Colorado)
- Senior FPGA Design Engineer (Vermont)
- Principal Engineer Network Tools (New Jersey)
- Overnight Machine Operator (Florida)
Recent Jobs
-
HPC Sr. Scientific Software Engineer (IT@JH Research Computing) - #Staff
- Johns Hopkins University (Baltimore, MD)
-
Python Developer
- Insight Global (Charlotte, NC)