-
Senior Data Engineer - Data Impact & Governance
- MD Anderson Cancer Center (Houston, TX)
-
*Summary:*
The mission of The University of Texas M. D. Anderson Cancer Center is to eliminate cancer in Texas, the nation, and the world through outstanding programs that integrate patient care, research, prevention, and education. Core to the success of our mission is the ability to orchestrate multidimensional data, data analytics, and machine learning to drive decisions that are safer, faster, and proven to improve outcomes. Join us as we turn data into lasting impact for every patient we serve.
We are seeking a Senior Data Engineer to lead enterprise-scale AI/ML data engineering efforts across the entire AI lifecycle, including data/feature/vector storage, management, and high-throughput data pipelines for inference and monitoring.
This role will architect, build, and optimize robust, secure, and scalable data pipelines that power advanced machine learning, generative AI, and agentic-AI applications in clinical and business operations.
You will directly shape the AI data infrastructure foundation, mentor our broader team of healthcare data engineers, and collaborate with Health IT, researchers, and clinical stakeholders to integrate cutting-edge AI/ML capabilities into real-world healthcare environments.
*Core Responsibilities include:*
• Build and Scale AI/ML Data Pipelines:
• Design, implement, and maintain batch and streaming data pipelines supporting ML training, deployment, inference, and monitoring using Azure, Dataiku, and other open-source tools.
*Data, Feature and Vector Store Engineering:*
• Deploy and manage raw data, feature and vector stores to enable fast, reliable access to feature data for production AI/ML systems.
*Automate Infrastructure and Deployments: *
• Use IaC and CI/CD workflows to automate infrastructure and pipeline deployments, improving reliability and efficiency across environments.
*Ensure Data Quality and Trust: *
• Implement validation, lineage, anomaly detection, and drift monitoring to deliver fresh, accurate, and compliant data.
*Security and Compliance by Design:*
• Enforce encryption, RBAC, tokenization, and audit logging, ensuring compliance with HIPAA/HITRUST and institutional standards while enabling scalable AI operations.
*Collaborate and Lead: *
• Partner with healthcare system focused data engineers, ML engineers, data scientists, product teams, and application owners to deliver scalable AI solutions. Provide mentorship, drive best practices, and foster a high-performance, learning-oriented culture.
*Own and Operate: *
• Be part of a DevOps model, owning pipelines and infrastructure end-to-end, including monitoring, alerting, incident management, and continuous improvement
*Technical Expertise *
Programming:
Python and SQL for large-scale data engineering; Spark for distributed processing.
*Cloud and On-Prem Data Platforms: *
Azure (Fabric OneLake, Synapse, blob storage, etc) and on-prem RDBMS, NEO4J, Mongo.
*Pipeline Orchestration:*
Airflow and Dataiku for workflow management; Spark/Fabric for high-volume processing.
*Feature and Vector Stores:*
Feast, Pinecone, PGVector, Azure Feature Store.
*Data Integration & Streaming: *
CDC, Kafka/Event Hubs, event-driven architectures; HL7, FHIR, DICOM standards.
*Deployment Automation:*
Terraform, Bicep, Helm for Kubernetes deployments, GitHub Actions, and Azure DevOps for CI/CD workflows.
*Monitoring & Observability: *
Data validation, lineage, anomaly detection, pipeline monitoring.
*Security & Compliance*:
Encryption, RBAC, audit logging; HIPAA/HITRUST compliance.
*Analytical Expertise *
Exploratory & descriptive analytics:
Conduct large-scale SQL/Python EDA to surface trends and anomalies.
*Data profiling & quality assessment:*
Execute profiling checks and root-cause analysis to remediate defects.
*Statistical inference & trend analysis:*
Apply statistical tests and time-series methods to monitor pipeline performance and data-quality drift.
*Data lineage & impact analysis:*
Quantify upstream/downstream dependencies and model impacts of schema changes.
*Healthcare datasets & data management:*
Deep understanding of data standards and practical experience with EHR (HL7, FHIR, clinical notes), medical imaging (DICOM/PACS), pathology, claims, and other healthcare data assets; experienced in extraction, normalization, and de-identification for analytics.
*Oral and Written Communication *
Collaborate with research data scientists, ML engineers, and software engineers to integrate machine learning models into existing systems.
Document processes, pipelines, workflows, and machine learning experiments.
Skilled in applying project management frameworks (e.g., Agile) to track progress and communicate risks and impacts to a variety of audiences.
Experience presenting complex technical concepts clearly to both technical and non-technical audiences through reports, meetings, and professional forums.
Manage stakeholder relations to facilitate solution adoption and address issues.
Other duties as assigned
*Required Education*: Bachelor's degree.
Preferred Education: Master's Level Degree
*Preferred Certification*: Must obtain at least one Epic Data Model certification (Clinical, Access, or Revenue) issued by Epic within 180 days of date of entry into job.
*Preferred Certification: *Any of the following:
Azure Data Engineer Associate (DP-203),
EPIC Cogito Certification,
HIPAA Privacy & Security Certification,
HL7/FHIR Certification.
*Required Experience:* Five years of relevant information technology experience. May substitute required education with years of related experience on a one-to-one basis. With preferred degree, three years of experience required.
*Preferred Experience*: Two years of industry experience in a Senior Data Scientist role, experience in the AI/Machine Learning Space, knowledge of data privacy, security, and HIPAA compliance in healthcare.
It is the policy of The University of Texas MD Anderson Cancer Center to provide equal employment opportunity without regard to race, color, religion, age, national origin, sex, gender, sexual orientation, gender identity/expression, disability, protected veteran status, genetic information, or any other basis protected by institutional policy or by federal, state or local laws unless such distinction is required by law. http://www.mdanderson.org/about-us/legal-and-policy/legal-statements/eeo-affirmative-action.html
Additional Information
* Requisition ID: 175514
* Employment Status: Full-Time
* Employee Status: Regular
* Work Week: Days
* Minimum Salary: US Dollar (USD) 123,000
* Midpoint Salary: US Dollar (USD) 154,000
* Maximum Salary : US Dollar (USD) 185,000
* FLSA: exempt and not eligible for overtime pay
* Fund Type: Hard
* Work Location: Remote (within Texas only)
* Pivotal Position: Yes
* Referral Bonus Available?: Yes
* Relocation Assistance Available?: Yes
* Science Jobs: No
\#LI-Remote
-
Recent Jobs
-
Senior Data Engineer - Data Impact & Governance
- MD Anderson Cancer Center (Houston, TX)
-
Manufacturing Design Transfer Engineer
- Cytiva (Miami, FL)
-
Regulatory Affairs Principal
- Cepheid (Miami, FL)
-
Sr. Staff Global Supplier Quality Engineer - Metals
- Beckman Coulter Diagnostics (Chaska, MN)