What is the PHT?
Documentation and User Guide for the Personal Health Train (PHT) an open source, container based secure distributed analysis platform. For more information about the PHT team, projects and collaborations you can also visit our website.
The Personal Health Train (PHT) is a paradigm proposed within the GO:FAIR initiative as one solution for distributed analysis of medical data, enhancing their FAIRness. Rather than transferring data to a central analysis site, the analysis algorithm (wrapped in a ‘train’), travels between multiple sites (e.g., hospitals – so-called ‘train stations’) hosting the data in a secure fashion.
From machine learning (ML) healthcare can profit by ‘learning’ models which support clinical practice in treatment decision support systems (TDSS). To increase the robustness of an obtained model and produce meaningful results, generally the analysis outcome depends on the number of training samples and data quality.
But meaningful data to improve predictions in medical research and healthcare is often distributed across multiple sites and is not easily accessible. This data contains highly sensitive patient information, may consist at each site different data formats and cannot be shared without explicit consent of the patient. Our goal is to make this data available for trains with stations to support privacy preserving distributed machine learning in healthcare with our open-source implementation of the PHT.
Implementing trains as light-weight containers enables even complex data analysis workflows to travel between sites, for example, genomics pipelines or deep-learning algorithms – analytics methods that are not easily amenable to established distributed queries or simple statistics.
RabbitMQ- Message broker for consuming and publishing commands & events between different services
Harbor- Docker registry to manage (train-) images
Vault- Secret storage to securely store sensitive information
User Interface (UI)- Frontend application for proposal and train management, downloading of results and much more
API- Backend application to manage resources and trigger commands & events through the message broker
Train Manager- Microservice serving different components:
Train Building- Build and distribute train images to a registry
Train Routing- Move trains between projects & registries accordingly to the route of the train
Result Extracting- Download, extract & serve encrypted results from the registry
Airflow- Open-Source-Tool to create and schedule workflows and enables persistent access to data, execution and monitoring of trains
Keycloak- Identity and Access Management (IAM) to manage users and roles
Desktop App- GUI to manage key pairs and decrypt results locally
The following flow chart depicts the security protocol used for protecting participating stations against malicious code,
as well as encrypting any stored results using envelope encryption.
This ensures that only approved algorithms are executed and that only previously registered participants in an analysis can access the results.
Wikipedia: Python (https://python.org) is an interpreted high-level general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation. Its language constructs as well as its object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.