Big Data Algorithms with Medical Applications
Yixin Chen
Big Data Algorithms with Medical Applications Yixin Chen Outline - - PowerPoint PPT Presentation
Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data algorithms Clinical Big Data Our new algorithms Small data vs. Big data Small data vs. Big data VS Small data vs. Big
Yixin Chen
Association
Efficiency interpretability
Accuracy
High efficiency
Large-scale Manifold Learning Maximum Variance Correction (Chen et al. ICML’13)
survivors is between six and seven times those for non-ICU care.
hospital wards (GHW) patients are not under extensive electronic monitoring and nurse care.
cardiopulmonary or respiratory arrest while in the GHW of hospital.
Sudden deteriorations (e.g. septic shock, cardiopulmonary or respiratory arrest) of GHW patients can often be severe and life threatening. Goal: Provide early detection and intervention based on data mining to prevent these serious,
data and wireless body sensor data A NSF/NIH funded clinical trial at Washington University/Barnes Jewish Hospital
Clinical Data: high-dimensional real-time time-series data 34 vital signs: pulse, temperature, oxygen saturation, shock index, respirations, blood pressure, …
Time/second Time/second
Main problems : Most previous general work uses a snapshot method that takes all the features at a given time as input to a model, discarding the temporal evolving of data Medical data mining
medical knowledge machine learning methods
SCAP and PSI
Acute Physiology Score, Chronic Health Score , and APACHE score are used to predict renal failures
Modified Early Warning Score (MEWS)
decision trees neural networks
SVM
5000 10000 15000 20000 25000 30000 Non-ICU ICU
Challenges:
dimensional time series data
Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)
Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)
kNN NB
NN
LR
Linear SVM Kernel SVM
Nonlinear classification ability Y N Y N N Y Interpretability N Y N Y Y N Direct support for mixed data types Y Y N N N N Efficiency Y Y Y Y Y N Multi-class classification Y Y Y Y N N
Random nonlinear feature transformation Parametric, linear classifier
1. Transform each input x into: exp(-i wk x), k= 1, …, K, wk ~ Gaussian distribution p(w)
Theory: based on Fourier transformation, RKS converges to RBF-SVM with large K Efficiency, but no interpretability
Non-parametric, Nonlinear Feature Transformation Parametric, Linear Classifier
Efficiency Interpretability Nonlinearity
kNN NB
NN
LR
Linear SVM Kernel SVM DLR
Nonlinear classification ability Y N Y N N Y Y Interpretability N Y N Y Y N Y Direct support for mixed data types Y Y N N N N Y Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N Y
DLR: Density-based Logistic Regression (Chen et al., KDD’13)
Each instance has D features:
Training dataset: Optimization: maximize the overall log likelihood
Assume:
is an increasing function of
Numerical : Kernel density estimation Categorical xd : (smoothed histogram)
kernel bandwidth
Training dataset: where
Maximize the overall log likelihood Objective: A function of
Initialize h and w Update w
Calculate new feature vector
Update h Converged? No
Fix and optimize (steepest gradient descent) Repeat until convergence (using a LR solver) Fix and optimize
Initial h iter 1 Iter 2 Iter 3
For example, represents a particular disease If represents the blood pressure (BP) of a patient On disease level Ranking can identify the risk factors of this disease indicates the abnormality of his BP indicates the extent of BP resulting in his disease On patient level
doesn’t consider the label information
DLR kernel: indicates same label indicates different label
Original LR Density-based LR Test Data:
numerical categorical
numerical categorical
SVM: 0.9194 DLR: 0.9204
Early alert when the patient appears normal to the best doctors in the world
estimation: kernel density smoothing Still too slow for big data Testing time grows as get larger No curse of dimensionality for estimation Ultra-fast training and testing estimation: histogram
Not smooth Not enough data
where is the number of label in bin i is the number of instances in bin i
5 bins 20 bins 100 bins
Splice 1K Mush 8K w5a 10K w8a 50K Adult 30K kddcu p 1.26M linearSVM
75 100 98.15 98.57 60.03 99.99
LR
77 99.87 97.67 98.24 84.80 99.99
RBF SVM
80 99.23 97.14 97.20 75.29 N/A
DLR-b
88 99.95 98.26 98.55 85.54 99.99
Splice 1K Mush 8K w5a 10K w8a 50K Adult 30K kddcu p 1.26M linearSVM 0.12 0.56 1.16
15 2847 81.70
LR
0.15 0.21 0.18 0.7 2.89 55.66
RBF SVM 0.09 1.63 1.60
29 217 N/A
DLR-b
0.22 0.32 2.65 7.6 0.6 17.93
along with constraints wd ≥ 0
Top features selected by DLR standard deviation of heart rate ApEn of heart rate Energy of oxygen saturation LF of oxygen saturation LF of heart rate DFA of oxygen saturation Mean of heart rate HF of heart rate Inertia of heart rate Homogeneity of heart rate Energy of heart rate linear correlation of heart rate of oxygen saturation
Try it out!
http://www.cse.wustl.edu/~wenlinchen/project/DLR/
nonlinearity/randomness
麦肯锡全球研究院报告:大数据人才稀缺
kNN NB
NN
LR
Linear SVM Kernel SVM Random Kitchen Sinks
Nonlinear classification ability Y N Y N N Y Y Interpretability N Y N Y Y N N Direct support for mixed data types Y Y N N N N N Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N N
Both GNB and LR express in a linear model GNB learns under GNB assumption LR learns using maximum likelihood of the data
Factorizing by Factorizing by