E-Health Predictive Platform

Overview

Welcome to the E-Health Predictive Platform Proof of Concept. This demonstration showcases an analytical platform for processing medical data and building predictive models to support clinical decision-making.

Our platform features two primary modules:

Heart Disease Risk Module – Predicting cardiovascular risk using tabular machine learning
Influenza Seasonality Module – Monitoring and forecasting seasonal influenza patterns using time series analysis

Key advantages:

Reproducible ETL and ML pipeline
High-quality metrics with clinical interpretability
Model explainability using SHAP and feature importance
Modular architecture ready for EHR integration

Use Cases

Heart Disease Risk Prediction

Machine learning models trained on 1,025 patients with 13 clinical features to predict cardiovascular disease risk. XGBoost and LightGBM achieve AUC ~0.99 with high sensitivity and explainability aligned with cardiology guidelines.

1,025 patient records
13 clinical features
Multiple ML models (XGBoost, LightGBM, RF, GBDT, SVC)
SHAP-based explainability

Influenza Seasonality Monitoring

Time series analysis of weekly influenza incidence across 16 Polish regions. Uses STL decomposition, ACF/PACF analysis, k-means clustering, and classical forecasting methods to identify seasonal patterns and support public health planning.

97 weekly observations
16 regions monitored
Seasonal decomposition and forecasting
Epidemiological clustering

How It Works

Our platform follows a systematic approach to transform raw clinical data into actionable insights:

Data Collection

Clinical data from electronic health records, epidemiological databases, and medical registries

→

ETL & Validation

Data cleaning, quality checks, outlier analysis, and clinical validation

→

ML Models

Tabular ML for classification, time series analysis for forecasting

→

Insights & Dashboards

Visualizations, alerts, and clinical recommendations

Limitations & Next Steps

Important Disclaimer: This is a Proof of Concept demonstration and not a certified medical device. It is intended to showcase technical capabilities and analytical approaches.

Current Limitations

Limited dataset size (POC-scale data)
No regulatory certification or clinical validation
Requires integration with larger medical registries
Needs multi-center validation studies
Missing some advanced biomarkers and imaging data

Roadmap

Integration with Electronic Health Records (EHR) systems
Expansion to additional disease modules
Real-time monitoring and alerting capabilities
Clinical validation in partnership with healthcare institutions
Regulatory compliance and certification processes

Zaciekawiony? Porozmawiajmy! →

Dataset & Clinical Context

The Heart Disease Risk Module is built on a dataset of 1,025 patients with 13 clinical features commonly used in cardiovascular risk assessment. The binary target variable indicates the presence or absence of heart disease.

Clinical Application

This module supports:

Early risk stratification – Identifying high-risk patients before severe symptoms develop
Clinical decision support – Providing evidence-based risk scores to complement physician judgment
Telemedicine integration – Enabling remote patient triage and monitoring
Alignment with guidelines – Features selected based on ESC/ACC/AHA cardiovascular assessment protocols

The dataset represents a diverse patient population with various cardiovascular risk profiles, allowing models to learn complex patterns associated with heart disease.

Variables Table

Explanatory variables and the response variable

Column	Description	Type
age	Patient age	Numerical
sex	Sex (1 = male, 0 = female)	Categorical
cp	Chest pain type (0–3)	Categorical
trestbps	Resting blood pressure (mm Hg)	Numerical
chol	Serum cholesterol level (mg/dl)	Numerical
fbs	Fasting blood sugar > 120 mg/dl (1 = yes, 0 = no)	Binary
restecg	Resting ECG results (0–2)	Categorical
thalach	Maximum heart rate achieved	Numerical
exang	Exercise-induced angina (1 = yes, 0 = no)	Binary
oldpeak	ST depression induced by exercise relative to rest	Numerical (continuous)
slope	Slope of the peak exercise ST segment (0–2)	Categorical
ca	Number of major vessels (0–4)	Numerical (discrete)
thal	Thalassemia test result (1–3 or 0–3)	Categorical
target	Heart disease presence (1 = yes, 0 = no)	Binary (response variable)

Data Quality & Distributions

The dataset exhibits high quality with minimal missing values. Clinical features show realistic distributions consistent with cardiovascular patient populations:

No imputation required – Dataset is complete with no missing values
Outliers are clinically valid – High blood pressure and cholesterol values reflect genuine high-risk patients
Balanced target distribution – Approximately equal representation of patients with and without heart disease
Numerical features properly scaled – Blood pressure, cholesterol, and heart rate in expected clinical ranges

Boxplot of key clinical features in the heart disease dataset — Distribution of selected clinical features in the heart disease dataset. Outliers represent genuine high-risk clinical values.

Model Benchmark

We compared five machine learning algorithms to identify the best approach for heart disease prediction. All models were evaluated using stratified cross-validation with standard metrics:

Model	Accuracy	Recall	Precision	AUC
XGBoost	0.98	0.98	0.98	0.99
LightGBM	0.97	0.97	0.97	0.99
Random Forest	0.95	0.94	0.96	0.98
Gradient Boosting	0.94	0.93	0.95	0.97
SVC	0.89	0.85	0.91	0.93

Key Findings:

Boosting methods dominate – XGBoost and LightGBM significantly outperform other approaches
High recall is critical – Minimizing false negatives (missed diagnoses) is paramount in medical applications
SVC underperforms – Too many false negatives make it unsuitable for clinical use despite decent accuracy

ROC AUC comparison of all heart disease models — ROC curves comparing all five models. XGBoost and LightGBM achieve near-perfect AUC scores.

Selected Model: XGBoost

XGBoost was selected as the primary model due to its superior performance across all metrics and excellent balance between sensitivity and specificity.

Performance Metrics

Accuracy: 98% – Correctly classifies nearly all patients
Recall: 98% – Catches almost all heart disease cases (minimal false negatives)
Precision: 98% – Very few false alarms
AUC: 0.99 – Excellent discrimination capability

Confusion matrix of the XGBoost heart disease classifier — XGBoost confusion matrix showing high true positive and true negative rates with minimal misclassifications.

Clinical Significance

The low false negative rate is particularly important in medical diagnostics. Missing a heart disease diagnosis (false negative) can have severe consequences, while false positives can be addressed through follow-up testing. XGBoost achieves an optimal balance.

Example Prediction: A 58-year-old male patient with chest pain type 2, elevated ST depression, and 2 major vessels showed a predicted risk score of 0.98, correctly identified as high-risk for heart disease.

Explainability & Clinical Validation

Model explainability is critical for clinical adoption. We used SHAP (SHapley Additive exPlanations) and feature importance analysis to understand which factors drive predictions.

Top Predictive Features

cp (chest pain type) – Different types of chest pain have varying associations with heart disease
thal (thalassemia test) – Blood disorder marker with strong predictive power
ca (number of major vessels) – Direct indicator of coronary artery disease severity
exang (exercise-induced angina) – Classic symptom of insufficient cardiac blood flow
oldpeak (ST depression) – ECG indicator of cardiac stress

Feature importance ranking for the XGBoost heart disease model — XGBoost feature importance showing the relative contribution of each clinical variable to model predictions.

Clinical Alignment

The top features identified by the model align perfectly with established cardiology guidelines (ESC/ACC/AHA). This concordance validates that the model has learned clinically meaningful patterns rather than spurious correlations.

SHAP decision plot for the Random Forest heart disease model — SHAP decision plot showing how individual features contribute to predictions for specific patients.

Limitations

While the model shows excellent performance, several limitations must be acknowledged:

Dataset Limitations

Small sample size – 1,025 patients is insufficient for full clinical validation
Missing biomarkers – No troponin, BNP, or other modern cardiac biomarkers
No imaging data – Echocardiography, CT, and MRI could enhance predictions
Limited demographic metadata – Missing data on ethnicity, socioeconomic factors, and comorbidities
Single-center origin – Dataset may not generalize to different populations or healthcare settings

Model Limitations

No external validation – Model has not been tested on independent datasets
Static predictions – Does not incorporate temporal changes or disease progression
Black-box complexity – Despite SHAP analysis, boosting models remain less interpretable than simple scores

Next Steps

Validate on larger, multi-center datasets
Incorporate additional biomarkers and imaging data
Develop temporal models to track disease progression
Conduct prospective clinical trials

Demo Example

Interactive demonstration of the heart disease risk prediction system. Enter patient data to see real-time risk assessment.

⚠️ IMPORTANT: This is a demonstration only and not intended for clinical use. Results are illustrative and based on simplified heuristic scoring. Always consult qualified healthcare professionals for medical decisions.

Patient Data Entry

Age

Sex

Chest Pain Type (0-3)

Resting BP (mm Hg)

Cholesterol (mg/dl)

Fasting Blood Sugar > 120 mg/dl

Resting ECG (0-2)

Max Heart Rate

Exercise-Induced Angina

ST Depression

Slope of ST Segment (0-2)

Number of Major Vessels (0-4)

Thalassemia (0-3)

Remember in this browser (localStorage)

Patient List

ID	Age	Sex	CP	Chol	BP	Oldpeak	Risk %	Risk Level	Actions
No patients added yet

Statistics

Total Patients

Average Risk

Min Risk

Max Risk

Risk Distribution

LOW 0

MEDIUM 0

HIGH 0

Dataset Overview

The Influenza Seasonality Module analyzes weekly influenza incidence data across 16 Polish regions (województwa). After filtering, the dataset contains 97 weekly observations representing approximately two influenza seasons.

Epidemiological Significance

Influenza surveillance is critical for:

Public health planning – Anticipating seasonal peaks to allocate medical resources
Vaccination campaigns – Timing immunization efforts based on seasonal patterns
Early warning systems – Detecting unusual activity that may indicate pandemic risk
Regional coordination – Understanding geographic spread patterns across provinces

Data is measured as cases per 100,000 inhabitants, allowing fair comparison across regions with different population sizes.

Weekly influenza incidence time series — Weekly influenza incidence per 100k inhabitants across Polish regions. Clear seasonal patterns with winter peaks are visible.

Time Series Pattern

The time series reveals strong seasonal characteristics typical of influenza epidemiology in temperate climates:

Observed Patterns

Winter seasonality – Sharp peaks during December-February when cold weather and indoor crowding facilitate viral transmission
Summer troughs – Near-zero incidence during warm months (June-August)
Rapid rise and fall – Epidemic waves build quickly over 4-6 weeks and decline over 6-8 weeks
Inter-regional synchrony – Most regions peak simultaneously, suggesting nationwide spread

Clinical Implications

These patterns align with known influenza biology: the virus survives longer in cold, dry air and people spend more time indoors during winter. Understanding these cycles helps healthcare systems prepare for predictable seasonal surges.

Autocorrelation & Lag Features

Time series analysis reveals strong autocorrelation, meaning current week's incidence is highly predictable from previous weeks.

Stationarity Testing

The Augmented Dickey-Fuller (ADF) test confirms the series is stationary (p < 0.05), meaning statistical properties remain consistent over time despite seasonal fluctuations. This validates the use of classical forecasting methods.

ACF and PACF plots for influenza time series — Autocorrelation (ACF) and Partial Autocorrelation (PACF) functions showing significant correlations at seasonal lags.

Lag Features for Prediction

We created lag features (lag_1, lag_2, lag_3) representing incidence from 1, 2, and 3 weeks prior. These features show strong predictive power:

lag_1 – Previous week is the strongest predictor (correlation > 0.9)
lag_2 – Two weeks prior still highly correlated
lag_3 – Three weeks prior provides additional context for trend direction

This autocorrelation structure makes influenza incidence an ideal candidate for time series forecasting models like SARIMA and Holt-Winters.

Clustering (K-Means)

We applied k-means clustering to identify groups of regions with similar epidemic patterns. Surprisingly, clusters do not follow geographic proximity but rather epidemiological similarity.

Optimal Cluster Count

Using the elbow method and silhouette scores, we identified k=4 as the optimal number of clusters. This suggests four distinct epidemic profiles across Poland.

K-means clustering of regions using PCA visualization — PCA projection of k-means clusters. Regions group by epidemic characteristics (amplitude, timing, volatility) rather than geography.

Cluster Interpretation

High-amplitude clusters – Urban regions with larger seasonal peaks
Stable clusters – Regions with consistent, predictable patterns
Volatile clusters – Regions with irregular fluctuations and secondary peaks
Low-incidence clusters – Rural or lower-population areas with muted epidemics

Public Health Insight

This clustering reveals that geographic neighbors may have very different epidemic dynamics. Public health interventions should account for epidemiological profiles, not just administrative boundaries.

Decomposition & Forecasting

We used STL (Seasonal and Trend decomposition using Loess) to separate the time series into three components: trend, seasonality, and residuals.

STL Decomposition

Trend component – Shows gradual changes in baseline incidence over multiple seasons
Seasonal component – Captures the repeating annual winter peak pattern
Residual component – Represents random noise and irregular events

Why Classical Models?

With only 97 data points, classical statistical models (SARIMA, Holt-Winters) are more appropriate than deep learning:

Deep learning (LSTM, GRU) requires hundreds or thousands of observations
SARIMA and Holt-Winters are specifically designed for seasonal data
Classical methods provide interpretable parameters
Lower risk of overfitting on small datasets

Forecast Performance

We demonstrate 4-8 week forecasts using SARIMA and Holt-Winters, showing reasonable accuracy for near-term predictions. Forecast uncertainty increases with horizon, as expected.

Applications & Limitations

Practical Applications

Seasonal alerts – Automated warnings when incidence exceeds historical thresholds
Resource planning – Hospitals can anticipate bed, staff, and supply needs 4-8 weeks ahead
Vaccination campaigns – Optimal timing based on predicted epidemic onset
Policy decisions – Evidence-based recommendations for school closures or public health measures during severe seasons
Research tool – Understanding regional differences to investigate social and environmental factors

Current Limitations

Short time horizon – Only ~2 seasons of data limits long-term trend analysis
Missing external factors – No data on weather, vaccination rates, population mobility, or virus strain
No sub-regional granularity – Province-level data may mask important local hotspots
Lacks severity metrics – Counts cases but not hospitalizations or mortality
Reporting delays – Real-world forecasting must account for 1-2 week lag in confirmed case data

Future Enhancements

Incorporate weather data (temperature, humidity) as exogenous variables
Integrate vaccination coverage rates by region
Add mobility data from cell phone networks to track viral spread
Extend to include other respiratory diseases (RSV, COVID-19) for multi-pathogen surveillance
Develop ensemble forecasting combining multiple models

Demo Example

Interactive visualization of influenza incidence data across Polish regions with forecasting capabilities.

⚠️ IMPORTANT: This is a demonstration with synthetic data and not intended for public health decisions. Data shown is illustrative and generated for demonstration purposes only. Real epidemiological decisions require validated data and professional analysis.

Influenza Incidence Time Series

Region:

Year:

Smoothing (weeks): 1

Compare with second region

Second Region:

Forecast (Demo)

Forecast Index

--/100

Seasonal Pressure

Next 4 Weeks Forecast

-- -- -- --

Statistics

Peak Week

Peak Value

Mean Value

Current vs Mean

Weeks Above Threshold (>120)

End-to-End Pipeline

Our platform follows a systematic data flow from raw clinical data to actionable insights:

1. Data Import

Ingest data from Electronic Health Records (EHR), epidemiological databases, medical registries, and clinical trials

↓

2. ETL & Validation

Extract, Transform, Load pipeline with data quality checks, outlier detection, clinical validation, and standardization

↓

3. Feature Engineering

Create derived features, lag variables for time series, interaction terms, and domain-specific transformations

↓

4. Model Training & Validation

Train ML models with cross-validation, hyperparameter tuning, and performance evaluation on held-out test sets

↓

5. Explainability Analysis

Generate SHAP values, feature importance, and clinical validation reports to ensure interpretability

↓

6. Dashboard & Alerts

Deploy models to interactive dashboards with real-time predictions, visualizations, and automated alerting

System Architecture

The platform is designed with a three-tier architecture ensuring modularity, scalability, and security:

Layer 1: Presentation (Web UI)

Interactive dashboards for clinicians and administrators
Responsive design for desktop and mobile access
Real-time visualization of predictions and trends
User authentication and role-based access control

Layer 2: Logic (API & ML Engine)

RESTful API for model serving and data queries
ML model registry with versioning and A/B testing
Background job scheduler for periodic retraining
Business logic for clinical rules and alerting thresholds

Layer 3: Data (Database & Storage)

Relational database for structured clinical data
Time-series database for epidemiological surveillance
Object storage for model artifacts and logs
Data warehouse for analytics and reporting

Deployment Model

On-premises deployment ensures compliance with medical data regulations:

All data remains within hospital or health authority infrastructure
No cloud upload of patient data
Full control over security and access policies
Compliance with GDPR, HIPAA, and local health data regulations

Security & Compliance

Data Security

Encryption at rest – All databases encrypted using AES-256
Encryption in transit – TLS 1.3 for all network communication
Access control – Role-based permissions with audit logging
Anonymization – Patient identifiers stripped or pseudonymized for analytics
Backup and recovery – Automated daily backups with tested restore procedures

Regulatory Compliance

While this POC is not a certified medical device, the architecture is designed with regulatory pathways in mind:

GDPR compliance – Data minimization, right to erasure, consent management
Medical device readiness – Documentation and validation framework aligned with EU MDR and FDA guidance
Clinical validation protocols – Structured process for prospective evaluation in real-world settings
Audit trails – Complete logging of all predictions, data access, and model updates

Ethical Considerations

Models are decision-support tools, not autonomous diagnostic systems
Final clinical decisions always rest with qualified healthcare professionals
Bias monitoring and fairness audits across demographic groups
Transparent communication of model limitations to users

Future Modules

The modular architecture allows for straightforward expansion to additional clinical domains:

Planned Disease Modules

Diabetes risk stratification – Predict progression from prediabetes to Type 2 diabetes
Chronic kidney disease – Early detection of declining renal function
Sepsis early warning – Real-time ICU monitoring for sepsis onset
Stroke risk prediction – Combining cardiovascular risk factors with imaging data
COVID-19 severity forecasting – Predict which patients require ICU admission
Multi-pathogen surveillance – Extend influenza module to RSV, COVID-19, and other respiratory diseases

Advanced Capabilities

EHR integration – Direct ingestion from Epic, Cerner, and other major systems
Telemedicine support – Risk scores accessible during virtual consultations
Mobile applications – Patient-facing apps for self-monitoring and education
Natural language processing – Extract insights from clinical notes and radiology reports
Computer vision – Analyze medical imaging (X-rays, CT, MRI) alongside tabular data
Federated learning – Train models across multiple hospitals without sharing patient data

Research & Development

Partnerships with academic medical centers for validation studies
Open-source contributions to advance healthcare AI
Participation in medical AI competitions and benchmarks
Publication of methods and results in peer-reviewed journals

Project Goals

This Proof of Concept demonstrates the technical and clinical feasibility of an AI-powered e-health analytics platform. Our primary objectives:

Demonstrate reproducible ML pipelines – Show that we can process real clinical data with proper ETL, validation, and model training workflows
Extract clinically meaningful insights – Prove that even with limited POC-scale data, we can derive actionable conclusions aligned with medical guidelines
Achieve high predictive accuracy – Deliver models with performance metrics suitable for clinical decision support (AUC > 0.95, high recall)
Ensure explainability – Use SHAP, feature importance, and clinical validation to make model predictions interpretable to healthcare professionals
Establish foundation for MVP – Create modular architecture that can scale to production-grade e-health platform with multiple disease modules

Success Criteria

We consider this POC successful if it demonstrates:

Ability to ingest and process diverse medical data types (tabular, time series)
Model performance meeting or exceeding published benchmarks
Clinical interpretability through explainability methods
Clear pathway to production deployment and regulatory compliance
Stakeholder confidence in technical capabilities and clinical value proposition

Methodology

Tabular Machine Learning (Heart Disease Module)

Data preprocessing – Quality checks, outlier analysis, feature scaling
Model selection – Comparison of XGBoost, LightGBM, Random Forest, GBDT, and SVC
Cross-validation – Stratified k-fold to ensure robust performance estimates
Hyperparameter tuning – Grid search and Bayesian optimization
Evaluation metrics – Accuracy, recall, precision, AUC with focus on minimizing false negatives
Explainability – SHAP values and feature importance to validate clinical alignment

Time Series Analysis (Influenza Module)

Exploratory analysis – Visualization of seasonal patterns and regional differences
Stationarity testing – Augmented Dickey-Fuller test to validate modeling assumptions
Decomposition – STL (Seasonal-Trend decomposition using Loess) to separate components
Autocorrelation analysis – ACF and PACF to identify lag dependencies
Feature engineering – Lag features, rolling averages, seasonal indicators
Clustering – K-means to identify epidemiologically similar regions
Forecasting – SARIMA and Holt-Winters for 4-8 week predictions

Software Engineering Practices

Version control – Git for all code, data schemas, and documentation
Reproducibility – Random seeds, environment files, containerization
Testing – Unit tests for data processing, integration tests for pipelines
Documentation – Comprehensive README, API documentation, clinical interpretation guides

Disclaimers

Not a Medical Device

This platform is a Proof of Concept demonstration and NOT a certified medical device. It has not undergone regulatory review or approval by any health authority (FDA, EMA, etc.).

Important Limitations

No diagnostic claims – This system does not diagnose, treat, cure, or prevent any disease
Decision support only – Predictions are intended to support, not replace, clinical judgment by qualified healthcare professionals
Limited validation – Models are trained on small datasets and have not been validated in prospective clinical trials
Research use only – Suitable for academic research, algorithm development, and technical demonstrations—not for patient care
No liability – Creators assume no liability for decisions made using this system or consequences thereof

Intended Audience

This demonstration is intended for:

Healthcare administrators evaluating AI/ML capabilities
Clinical informaticists and data scientists
Academic researchers in medical AI
Potential investors and partners
Regulatory and compliance professionals

Path to Clinical Use

Before any clinical deployment, this system would require:

Validation on large, multi-center datasets
Prospective clinical trials demonstrating safety and efficacy
Regulatory review and approval as a medical device
Integration with hospital quality assurance and safety protocols
Ongoing monitoring and periodic revalidation

Team & Acknowledgments

Project Team

This Proof of Concept was developed by a multidisciplinary team combining expertise in:

Machine Learning & Data Science
Clinical Medicine & Public Health
Software Engineering & DevOps
Regulatory Affairs & Healthcare Compliance

Data Sources

We acknowledge the following data sources:

Heart Disease Dataset – UCI Machine Learning Repository, publicly available research dataset
Influenza Surveillance Data – Weekly epidemiological reports from Polish health authorities

Open Source Tools

This project builds on excellent open-source software:

Python scientific stack (NumPy, Pandas, Scikit-learn)
XGBoost and LightGBM for gradient boosting
Statsmodels for time series analysis
SHAP for model explainability
Matplotlib and Seaborn for visualization

Contact & Collaboration

We welcome feedback, collaboration opportunities, and discussions about potential applications. Please contact us for:

Technical questions about the platform
Partnership and licensing inquiries
Academic collaborations and research projects
Pilot deployment discussions with healthcare organizations

Contact

Interested in collaboration, pilot deployment, or technical details? Get in touch with us using the information below or the contact form.

Author

Email

contact@skillandchill.com

Address

SKILL & CHILL / Sw. Marcin 29/8, Poznań