Overview

Welcome to the E-Health Predictive Platform Proof of Concept. This demonstration showcases an analytical platform for processing medical data and building predictive models to support clinical decision-making.

Our platform features two primary modules:

  • Heart Disease Risk Module – Predicting cardiovascular risk using tabular machine learning
  • Influenza Seasonality Module – Monitoring and forecasting seasonal influenza patterns using time series analysis

Key advantages:

  • Reproducible ETL and ML pipeline
  • High-quality metrics with clinical interpretability
  • Model explainability using SHAP and feature importance
  • Modular architecture ready for EHR integration

Use Cases

Heart Disease Risk Prediction

Machine learning models trained on 1,025 patients with 13 clinical features to predict cardiovascular disease risk. XGBoost and LightGBM achieve AUC ~0.99 with high sensitivity and explainability aligned with cardiology guidelines.

  • 1,025 patient records
  • 13 clinical features
  • Multiple ML models (XGBoost, LightGBM, RF, GBDT, SVC)
  • SHAP-based explainability

Influenza Seasonality Monitoring

Time series analysis of weekly influenza incidence across 16 Polish regions. Uses STL decomposition, ACF/PACF analysis, k-means clustering, and classical forecasting methods to identify seasonal patterns and support public health planning.

  • 97 weekly observations
  • 16 regions monitored
  • Seasonal decomposition and forecasting
  • Epidemiological clustering

How It Works

Our platform follows a systematic approach to transform raw clinical data into actionable insights:

1

Data Collection

Clinical data from electronic health records, epidemiological databases, and medical registries

2

ETL & Validation

Data cleaning, quality checks, outlier analysis, and clinical validation

3

ML Models

Tabular ML for classification, time series analysis for forecasting

4

Insights & Dashboards

Visualizations, alerts, and clinical recommendations

Limitations & Next Steps

Important Disclaimer: This is a Proof of Concept demonstration and not a certified medical device. It is intended to showcase technical capabilities and analytical approaches.

Current Limitations

  • Limited dataset size (POC-scale data)
  • No regulatory certification or clinical validation
  • Requires integration with larger medical registries
  • Needs multi-center validation studies
  • Missing some advanced biomarkers and imaging data

Roadmap

  • Integration with Electronic Health Records (EHR) systems
  • Expansion to additional disease modules
  • Real-time monitoring and alerting capabilities
  • Clinical validation in partnership with healthcare institutions
  • Regulatory compliance and certification processes

Dataset & Clinical Context

The Heart Disease Risk Module is built on a dataset of 1,025 patients with 13 clinical features commonly used in cardiovascular risk assessment. The binary target variable indicates the presence or absence of heart disease.

Clinical Application

This module supports:

  • Early risk stratification – Identifying high-risk patients before severe symptoms develop
  • Clinical decision support – Providing evidence-based risk scores to complement physician judgment
  • Telemedicine integration – Enabling remote patient triage and monitoring
  • Alignment with guidelines – Features selected based on ESC/ACC/AHA cardiovascular assessment protocols

The dataset represents a diverse patient population with various cardiovascular risk profiles, allowing models to learn complex patterns associated with heart disease.

Variables Table

Explanatory variables and the response variable

Column Description Type
agePatient ageNumerical
sexSex (1 = male, 0 = female)Categorical
cpChest pain type (0–3)Categorical
trestbpsResting blood pressure (mm Hg)Numerical
cholSerum cholesterol level (mg/dl)Numerical
fbsFasting blood sugar > 120 mg/dl (1 = yes, 0 = no)Binary
restecgResting ECG results (0–2)Categorical
thalachMaximum heart rate achievedNumerical
exangExercise-induced angina (1 = yes, 0 = no)Binary
oldpeakST depression induced by exercise relative to restNumerical (continuous)
slopeSlope of the peak exercise ST segment (0–2)Categorical
caNumber of major vessels (0–4)Numerical (discrete)
thalThalassemia test result (1–3 or 0–3)Categorical
target Heart disease presence (1 = yes, 0 = no) Binary (response variable)

Data Quality & Distributions

The dataset exhibits high quality with minimal missing values. Clinical features show realistic distributions consistent with cardiovascular patient populations:

  • No imputation required – Dataset is complete with no missing values
  • Outliers are clinically valid – High blood pressure and cholesterol values reflect genuine high-risk patients
  • Balanced target distribution – Approximately equal representation of patients with and without heart disease
  • Numerical features properly scaled – Blood pressure, cholesterol, and heart rate in expected clinical ranges
Boxplot of key clinical features in the heart disease dataset
Distribution of selected clinical features in the heart disease dataset. Outliers represent genuine high-risk clinical values.

Model Benchmark

We compared five machine learning algorithms to identify the best approach for heart disease prediction. All models were evaluated using stratified cross-validation with standard metrics:

Model Accuracy Recall Precision AUC
XGBoost 0.98 0.98 0.98 0.99
LightGBM 0.97 0.97 0.97 0.99
Random Forest 0.95 0.94 0.96 0.98
Gradient Boosting 0.94 0.93 0.95 0.97
SVC 0.89 0.85 0.91 0.93

Key Findings:

  • Boosting methods dominate – XGBoost and LightGBM significantly outperform other approaches
  • High recall is critical – Minimizing false negatives (missed diagnoses) is paramount in medical applications
  • SVC underperforms – Too many false negatives make it unsuitable for clinical use despite decent accuracy
ROC AUC comparison of all heart disease models
ROC curves comparing all five models. XGBoost and LightGBM achieve near-perfect AUC scores.

Selected Model: XGBoost

XGBoost was selected as the primary model due to its superior performance across all metrics and excellent balance between sensitivity and specificity.

Performance Metrics

  • Accuracy: 98% – Correctly classifies nearly all patients
  • Recall: 98% – Catches almost all heart disease cases (minimal false negatives)
  • Precision: 98% – Very few false alarms
  • AUC: 0.99 – Excellent discrimination capability
Confusion matrix of the XGBoost heart disease classifier
XGBoost confusion matrix showing high true positive and true negative rates with minimal misclassifications.

Clinical Significance

The low false negative rate is particularly important in medical diagnostics. Missing a heart disease diagnosis (false negative) can have severe consequences, while false positives can be addressed through follow-up testing. XGBoost achieves an optimal balance.

Example Prediction: A 58-year-old male patient with chest pain type 2, elevated ST depression, and 2 major vessels showed a predicted risk score of 0.98, correctly identified as high-risk for heart disease.

Explainability & Clinical Validation

Model explainability is critical for clinical adoption. We used SHAP (SHapley Additive exPlanations) and feature importance analysis to understand which factors drive predictions.

Top Predictive Features

  1. cp (chest pain type) – Different types of chest pain have varying associations with heart disease
  2. thal (thalassemia test) – Blood disorder marker with strong predictive power
  3. ca (number of major vessels) – Direct indicator of coronary artery disease severity
  4. exang (exercise-induced angina) – Classic symptom of insufficient cardiac blood flow
  5. oldpeak (ST depression) – ECG indicator of cardiac stress
Feature importance ranking for the XGBoost heart disease model
XGBoost feature importance showing the relative contribution of each clinical variable to model predictions.

Clinical Alignment

The top features identified by the model align perfectly with established cardiology guidelines (ESC/ACC/AHA). This concordance validates that the model has learned clinically meaningful patterns rather than spurious correlations.

SHAP decision plot for the Random Forest heart disease model
SHAP decision plot showing how individual features contribute to predictions for specific patients.

Limitations

While the model shows excellent performance, several limitations must be acknowledged:

Dataset Limitations

  • Small sample size – 1,025 patients is insufficient for full clinical validation
  • Missing biomarkers – No troponin, BNP, or other modern cardiac biomarkers
  • No imaging data – Echocardiography, CT, and MRI could enhance predictions
  • Limited demographic metadata – Missing data on ethnicity, socioeconomic factors, and comorbidities
  • Single-center origin – Dataset may not generalize to different populations or healthcare settings

Model Limitations

  • No external validation – Model has not been tested on independent datasets
  • Static predictions – Does not incorporate temporal changes or disease progression
  • Black-box complexity – Despite SHAP analysis, boosting models remain less interpretable than simple scores

Next Steps

  • Validate on larger, multi-center datasets
  • Incorporate additional biomarkers and imaging data
  • Develop temporal models to track disease progression
  • Conduct prospective clinical trials

Demo Example

Interactive demonstration of the heart disease risk prediction system. Enter patient data to see real-time risk assessment.

⚠️ IMPORTANT: This is a demonstration only and not intended for clinical use. Results are illustrative and based on simplified heuristic scoring. Always consult qualified healthcare professionals for medical decisions.

Patient Data Entry

Patient List

ID Age Sex CP Chol BP Oldpeak Risk % Risk Level Actions
No patients added yet

Statistics

Total Patients
0
Average Risk
0%
Min Risk
-
Max Risk
-
Risk Distribution
LOW 0
MEDIUM 0
HIGH 0

Dataset Overview

The Influenza Seasonality Module analyzes weekly influenza incidence data across 16 Polish regions (województwa). After filtering, the dataset contains 97 weekly observations representing approximately two influenza seasons.

Epidemiological Significance

Influenza surveillance is critical for:

  • Public health planning – Anticipating seasonal peaks to allocate medical resources
  • Vaccination campaigns – Timing immunization efforts based on seasonal patterns
  • Early warning systems – Detecting unusual activity that may indicate pandemic risk
  • Regional coordination – Understanding geographic spread patterns across provinces

Data is measured as cases per 100,000 inhabitants, allowing fair comparison across regions with different population sizes.

Weekly influenza incidence time series
Weekly influenza incidence per 100k inhabitants across Polish regions. Clear seasonal patterns with winter peaks are visible.

Time Series Pattern

The time series reveals strong seasonal characteristics typical of influenza epidemiology in temperate climates:

Observed Patterns

  • Winter seasonality – Sharp peaks during December-February when cold weather and indoor crowding facilitate viral transmission
  • Summer troughs – Near-zero incidence during warm months (June-August)
  • Rapid rise and fall – Epidemic waves build quickly over 4-6 weeks and decline over 6-8 weeks
  • Inter-regional synchrony – Most regions peak simultaneously, suggesting nationwide spread

Clinical Implications

These patterns align with known influenza biology: the virus survives longer in cold, dry air and people spend more time indoors during winter. Understanding these cycles helps healthcare systems prepare for predictable seasonal surges.

Autocorrelation & Lag Features

Time series analysis reveals strong autocorrelation, meaning current week's incidence is highly predictable from previous weeks.

Stationarity Testing

The Augmented Dickey-Fuller (ADF) test confirms the series is stationary (p < 0.05), meaning statistical properties remain consistent over time despite seasonal fluctuations. This validates the use of classical forecasting methods.

ACF and PACF plots for influenza time series
Autocorrelation (ACF) and Partial Autocorrelation (PACF) functions showing significant correlations at seasonal lags.

Lag Features for Prediction

We created lag features (lag_1, lag_2, lag_3) representing incidence from 1, 2, and 3 weeks prior. These features show strong predictive power:

  • lag_1 – Previous week is the strongest predictor (correlation > 0.9)
  • lag_2 – Two weeks prior still highly correlated
  • lag_3 – Three weeks prior provides additional context for trend direction

This autocorrelation structure makes influenza incidence an ideal candidate for time series forecasting models like SARIMA and Holt-Winters.

Clustering (K-Means)

We applied k-means clustering to identify groups of regions with similar epidemic patterns. Surprisingly, clusters do not follow geographic proximity but rather epidemiological similarity.

Optimal Cluster Count

Using the elbow method and silhouette scores, we identified k=4 as the optimal number of clusters. This suggests four distinct epidemic profiles across Poland.

K-means clustering of regions using PCA visualization
PCA projection of k-means clusters. Regions group by epidemic characteristics (amplitude, timing, volatility) rather than geography.

Cluster Interpretation

  • High-amplitude clusters – Urban regions with larger seasonal peaks
  • Stable clusters – Regions with consistent, predictable patterns
  • Volatile clusters – Regions with irregular fluctuations and secondary peaks
  • Low-incidence clusters – Rural or lower-population areas with muted epidemics

Public Health Insight

This clustering reveals that geographic neighbors may have very different epidemic dynamics. Public health interventions should account for epidemiological profiles, not just administrative boundaries.

Decomposition & Forecasting

We used STL (Seasonal and Trend decomposition using Loess) to separate the time series into three components: trend, seasonality, and residuals.

STL Decomposition

  • Trend component – Shows gradual changes in baseline incidence over multiple seasons
  • Seasonal component – Captures the repeating annual winter peak pattern
  • Residual component – Represents random noise and irregular events

Why Classical Models?

With only 97 data points, classical statistical models (SARIMA, Holt-Winters) are more appropriate than deep learning:

  • Deep learning (LSTM, GRU) requires hundreds or thousands of observations
  • SARIMA and Holt-Winters are specifically designed for seasonal data
  • Classical methods provide interpretable parameters
  • Lower risk of overfitting on small datasets

Forecast Performance

We demonstrate 4-8 week forecasts using SARIMA and Holt-Winters, showing reasonable accuracy for near-term predictions. Forecast uncertainty increases with horizon, as expected.

Applications & Limitations

Practical Applications

  • Seasonal alerts – Automated warnings when incidence exceeds historical thresholds
  • Resource planning – Hospitals can anticipate bed, staff, and supply needs 4-8 weeks ahead
  • Vaccination campaigns – Optimal timing based on predicted epidemic onset
  • Policy decisions – Evidence-based recommendations for school closures or public health measures during severe seasons
  • Research tool – Understanding regional differences to investigate social and environmental factors

Current Limitations

  • Short time horizon – Only ~2 seasons of data limits long-term trend analysis
  • Missing external factors – No data on weather, vaccination rates, population mobility, or virus strain
  • No sub-regional granularity – Province-level data may mask important local hotspots
  • Lacks severity metrics – Counts cases but not hospitalizations or mortality
  • Reporting delays – Real-world forecasting must account for 1-2 week lag in confirmed case data

Future Enhancements

  • Incorporate weather data (temperature, humidity) as exogenous variables
  • Integrate vaccination coverage rates by region
  • Add mobility data from cell phone networks to track viral spread
  • Extend to include other respiratory diseases (RSV, COVID-19) for multi-pathogen surveillance
  • Develop ensemble forecasting combining multiple models

Demo Example

Interactive visualization of influenza incidence data across Polish regions with forecasting capabilities.

⚠️ IMPORTANT: This is a demonstration with synthetic data and not intended for public health decisions. Data shown is illustrative and generated for demonstration purposes only. Real epidemiological decisions require validated data and professional analysis.

Influenza Incidence Time Series

Forecast (Demo)

Forecast Index
--/100
Seasonal Pressure
--
Next 4 Weeks Forecast
-- -- -- --

Statistics

Peak Week
--
Peak Value
--
Mean Value
--
Current vs Mean
--
Weeks Above Threshold (>120)
--

End-to-End Pipeline

Our platform follows a systematic data flow from raw clinical data to actionable insights:

1. Data Import

Ingest data from Electronic Health Records (EHR), epidemiological databases, medical registries, and clinical trials

2. ETL & Validation

Extract, Transform, Load pipeline with data quality checks, outlier detection, clinical validation, and standardization

3. Feature Engineering

Create derived features, lag variables for time series, interaction terms, and domain-specific transformations

4. Model Training & Validation

Train ML models with cross-validation, hyperparameter tuning, and performance evaluation on held-out test sets

5. Explainability Analysis

Generate SHAP values, feature importance, and clinical validation reports to ensure interpretability

6. Dashboard & Alerts

Deploy models to interactive dashboards with real-time predictions, visualizations, and automated alerting

System Architecture

The platform is designed with a three-tier architecture ensuring modularity, scalability, and security:

Layer 1: Presentation (Web UI)

  • Interactive dashboards for clinicians and administrators
  • Responsive design for desktop and mobile access
  • Real-time visualization of predictions and trends
  • User authentication and role-based access control

Layer 2: Logic (API & ML Engine)

  • RESTful API for model serving and data queries
  • ML model registry with versioning and A/B testing
  • Background job scheduler for periodic retraining
  • Business logic for clinical rules and alerting thresholds

Layer 3: Data (Database & Storage)

  • Relational database for structured clinical data
  • Time-series database for epidemiological surveillance
  • Object storage for model artifacts and logs
  • Data warehouse for analytics and reporting

Deployment Model

On-premises deployment ensures compliance with medical data regulations:

  • All data remains within hospital or health authority infrastructure
  • No cloud upload of patient data
  • Full control over security and access policies
  • Compliance with GDPR, HIPAA, and local health data regulations

Security & Compliance

Data Security

  • Encryption at rest – All databases encrypted using AES-256
  • Encryption in transit – TLS 1.3 for all network communication
  • Access control – Role-based permissions with audit logging
  • Anonymization – Patient identifiers stripped or pseudonymized for analytics
  • Backup and recovery – Automated daily backups with tested restore procedures

Regulatory Compliance

While this POC is not a certified medical device, the architecture is designed with regulatory pathways in mind:

  • GDPR compliance – Data minimization, right to erasure, consent management
  • Medical device readiness – Documentation and validation framework aligned with EU MDR and FDA guidance
  • Clinical validation protocols – Structured process for prospective evaluation in real-world settings
  • Audit trails – Complete logging of all predictions, data access, and model updates

Ethical Considerations

  • Models are decision-support tools, not autonomous diagnostic systems
  • Final clinical decisions always rest with qualified healthcare professionals
  • Bias monitoring and fairness audits across demographic groups
  • Transparent communication of model limitations to users

Future Modules

The modular architecture allows for straightforward expansion to additional clinical domains:

Planned Disease Modules

  • Diabetes risk stratification – Predict progression from prediabetes to Type 2 diabetes
  • Chronic kidney disease – Early detection of declining renal function
  • Sepsis early warning – Real-time ICU monitoring for sepsis onset
  • Stroke risk prediction – Combining cardiovascular risk factors with imaging data
  • COVID-19 severity forecasting – Predict which patients require ICU admission
  • Multi-pathogen surveillance – Extend influenza module to RSV, COVID-19, and other respiratory diseases

Advanced Capabilities

  • EHR integration – Direct ingestion from Epic, Cerner, and other major systems
  • Telemedicine support – Risk scores accessible during virtual consultations
  • Mobile applications – Patient-facing apps for self-monitoring and education
  • Natural language processing – Extract insights from clinical notes and radiology reports
  • Computer vision – Analyze medical imaging (X-rays, CT, MRI) alongside tabular data
  • Federated learning – Train models across multiple hospitals without sharing patient data

Research & Development

  • Partnerships with academic medical centers for validation studies
  • Open-source contributions to advance healthcare AI
  • Participation in medical AI competitions and benchmarks
  • Publication of methods and results in peer-reviewed journals

Project Goals

This Proof of Concept demonstrates the technical and clinical feasibility of an AI-powered e-health analytics platform. Our primary objectives:

  • Demonstrate reproducible ML pipelines – Show that we can process real clinical data with proper ETL, validation, and model training workflows
  • Extract clinically meaningful insights – Prove that even with limited POC-scale data, we can derive actionable conclusions aligned with medical guidelines
  • Achieve high predictive accuracy – Deliver models with performance metrics suitable for clinical decision support (AUC > 0.95, high recall)
  • Ensure explainability – Use SHAP, feature importance, and clinical validation to make model predictions interpretable to healthcare professionals
  • Establish foundation for MVP – Create modular architecture that can scale to production-grade e-health platform with multiple disease modules

Success Criteria

We consider this POC successful if it demonstrates:

  1. Ability to ingest and process diverse medical data types (tabular, time series)
  2. Model performance meeting or exceeding published benchmarks
  3. Clinical interpretability through explainability methods
  4. Clear pathway to production deployment and regulatory compliance
  5. Stakeholder confidence in technical capabilities and clinical value proposition

Methodology

Tabular Machine Learning (Heart Disease Module)

  • Data preprocessing – Quality checks, outlier analysis, feature scaling
  • Model selection – Comparison of XGBoost, LightGBM, Random Forest, GBDT, and SVC
  • Cross-validation – Stratified k-fold to ensure robust performance estimates
  • Hyperparameter tuning – Grid search and Bayesian optimization
  • Evaluation metrics – Accuracy, recall, precision, AUC with focus on minimizing false negatives
  • Explainability – SHAP values and feature importance to validate clinical alignment

Time Series Analysis (Influenza Module)

  • Exploratory analysis – Visualization of seasonal patterns and regional differences
  • Stationarity testing – Augmented Dickey-Fuller test to validate modeling assumptions
  • Decomposition – STL (Seasonal-Trend decomposition using Loess) to separate components
  • Autocorrelation analysis – ACF and PACF to identify lag dependencies
  • Feature engineering – Lag features, rolling averages, seasonal indicators
  • Clustering – K-means to identify epidemiologically similar regions
  • Forecasting – SARIMA and Holt-Winters for 4-8 week predictions

Software Engineering Practices

  • Version control – Git for all code, data schemas, and documentation
  • Reproducibility – Random seeds, environment files, containerization
  • Testing – Unit tests for data processing, integration tests for pipelines
  • Documentation – Comprehensive README, API documentation, clinical interpretation guides

Disclaimers

Not a Medical Device

This platform is a Proof of Concept demonstration and NOT a certified medical device. It has not undergone regulatory review or approval by any health authority (FDA, EMA, etc.).

Important Limitations

  • No diagnostic claims – This system does not diagnose, treat, cure, or prevent any disease
  • Decision support only – Predictions are intended to support, not replace, clinical judgment by qualified healthcare professionals
  • Limited validation – Models are trained on small datasets and have not been validated in prospective clinical trials
  • Research use only – Suitable for academic research, algorithm development, and technical demonstrations—not for patient care
  • No liability – Creators assume no liability for decisions made using this system or consequences thereof

Intended Audience

This demonstration is intended for:

  • Healthcare administrators evaluating AI/ML capabilities
  • Clinical informaticists and data scientists
  • Academic researchers in medical AI
  • Potential investors and partners
  • Regulatory and compliance professionals

Path to Clinical Use

Before any clinical deployment, this system would require:

  1. Validation on large, multi-center datasets
  2. Prospective clinical trials demonstrating safety and efficacy
  3. Regulatory review and approval as a medical device
  4. Integration with hospital quality assurance and safety protocols
  5. Ongoing monitoring and periodic revalidation

Team & Acknowledgments

Project Team

This Proof of Concept was developed by a multidisciplinary team combining expertise in:

  • Machine Learning & Data Science
  • Clinical Medicine & Public Health
  • Software Engineering & DevOps
  • Regulatory Affairs & Healthcare Compliance

Data Sources

We acknowledge the following data sources:

  • Heart Disease Dataset – UCI Machine Learning Repository, publicly available research dataset
  • Influenza Surveillance Data – Weekly epidemiological reports from Polish health authorities

Open Source Tools

This project builds on excellent open-source software:

  • Python scientific stack (NumPy, Pandas, Scikit-learn)
  • XGBoost and LightGBM for gradient boosting
  • Statsmodels for time series analysis
  • SHAP for model explainability
  • Matplotlib and Seaborn for visualization

Contact & Collaboration

We welcome feedback, collaboration opportunities, and discussions about potential applications. Please contact us for:

  • Technical questions about the platform
  • Partnership and licensing inquiries
  • Academic collaborations and research projects
  • Pilot deployment discussions with healthcare organizations

Contact

Interested in collaboration, pilot deployment, or technical details? Get in touch with us using the information below or the contact form.

Author

Address

SKILL & CHILL / Sw. Marcin 29/8, Poznań

Contact Form