We first sought to find interpretable ML models that can predict seizure events with high sensitivity and accuracy. We chose four models based on (1) high interpretability, (2) computational efficiency to ensure fast training and testing on large datasets, and (3) availability within the scikit-learn library, which provides a robust and cohesive framework. Specifically, we chose to test the decision tree model (DT), gaussian naïve bayes model (GNB), and stochastic gradient descent classifier (SGD) since it is a fast optimization method for logistic regression and support vector machine models. Lastly, we also chose the passive aggressive classifier (PAC) as a different optimization method that could also implement the support vector machine model (For details see methods – Model Selection and Training).
To assess the model’s performance, we used a large electrographic dataset from a well-established model of temporal lobe epilepsy in mice with recordings obtained from a depth electrode in a ventral hippocampus and an EEG screw placed over the frontal cortex (See Methods – EEG/LFP Dataset for further information). Example seizures and spectral profiles from the training dataset can be viewed in Supplementary Figs. 1 and 2. We then split the dataset into training (11 mice, 4224 h, 421 seizures), and testing (15 mice, 5511 h, 608 seizures). Subsequently, features were extracted, and 6 feature sets were created based on feature dependency to seizure presence (See methods – Feature Selection). We then used k-fold (k = 5) validation to train 5 models for each feature set and model type (80% training, 20% validation) to account for model variability and to obtain a more reliable estimate of model performance. Additionally, we compared the performance of the models across two different normalization strategies (normalization per recording file vs normalization of all files together) and three different feature types (local features, relative features, or a combination of both) (for more information, please see methods – Feature Extraction Feature Selection).
To quantify the influence of these 4 parameters (normalization type, feature type, feature-set, and model type) on model performance we used a four-way ANOVA to calculate the effect size of these parameters on model performance. The performance metrics chosen for this analysis were the F1 score and balanced accuracy as they provide a comprehensive assessment of overall accuracy in imbalanced datasets (See methods – model metrics). Specifically, the F1 score balances precision and recall focusing on the model’s ability to correctly identify seizure segments, whereas balanced accuracy combines recall with model specificity incorporating both the detection of seizure and non-seizure segments.
The ANOVA analysis showed that the two most important parameters (excluding interaction effects) were model and normalization type (Fig. 2A-B). Where, the most important parameter for F1 score was the model type and for balanced accuracy the normalization type (Fig. 2A-B). Therefore, we proceeded to look at the effect of normalization type on F1 score and balanced accuracy broken down by model type (Fig. 2C-D). We found that per-file normalization resulted in better F1 and balanced accuracy scores across models (Fig. 2C-D) compared to all-file normalization. There was a small exception where some GNB models had higher F1 scores with all-file normalization (Fig. 2C), however those models suffered from low recall and poor seizure detection (data not shown). Therefore, we focused on the effect of feature type on data that had only per-file normalization. We found that feature type (local, relative, local + relative) did not robustly affect F1 and balanced accuracy scores across model types (Fig. 2E-F). Thus, we chose to focus our analysis on local features as they were simpler and faster to calculate. We then compared six feature sets (top 5, top 10, top 15 with or without least correlated features) on models that were trained using per-file normalization and local features. The effects of the six feature sets on F1 score and balanced accuracy were very modest (Fig. 2G-H) as was also indicated by the effect size of the four-way ANOVA (Fig. 2A-B). Importantly, the GNB model had the highest F1 score, whereas the PAC model had the lowest F1 score across feature-sets.
Fig. 2Comparing the effect of four parameters on seizure segment detection. A-B) Effect sizes calculated using a four-way ANOVA (four parameters) on A) F1 score and B) Balanced accuracy. Parameters: 1) Normalization type (all files normalized together or per file normalization), 2) model type (DT, GNB, PAC, SGD), 3) feature type (local, relative, or both), 4) feature set (Top 5, Top 10, Top 15—with or without 5 least correlated features. C-D) Effect of normalization type (norm. type) on C) F1 score, and D) Balanced accuracy. E–F) Effect of feature type (feat. type) on E) F1 score and F) Balanced accuracy for per-file normalization only. G-H) Effect of feature set (feat. set) on G) F1 score and H) Balanced accuracy for per-file normalization with local feature types only
GNB and SGD Models Reliably Detect Seizure Segments Around the Seizure CenterWe then investigated how seizure predictions were localized in time around the seizure center focusing only on models trained using per-file normalization and local features. Thus, we investigated how the predicted seizure bins across time compared to ground truth data across models. When plotting seizure predictions across ground truth data we observed that the pattern of predictions varied across models independent of feature-set (Supplementary Fig. 3). Thus, we pulled model predictions across feature-sets together and compared predicted seizure bins vs ground truth across models (Fig. 3). We observed that GNB and SGD model predictions were restricted around seizure events indicating accurate seizure prediction (Fig. 3B, D). DT model predictions were also restricted around seizure events with overestimation of the seizure termination zone (Fig. 3A).
Fig. 3Seizure prediction across time vs ground truth data. (A-D) Number of ground truth vs predicted seizure across time bins. A) DT, B) GNB, C) PAC, D) SGD across all feature sets together. Only per-file normalization with local feature types were compared. The difference between model predictions and ground truth data is plotted on the right. Dotted line indicates the seizure center
Finally, the PAC model predictions were not restricted around seizure periods, indicating that the model was not accurate. This was also reflected from the low F1 score observed when comparing the feature-sets (Fig. 2G). Additionally, expanding our hyperparameter search did not improve the PAC model performance (data not shown). On the other hand, SGD models with the same loss function (Hinge Loss) as the PAC model had a superior performance (Supplementary Fig. 4), indicating that the poor performance of the PAC model arises due to the optimization algorithm. Therefore, the PAC model was precluded from further analysis.
Simple Post-Processing Methods Robustly Improve Seizure Detection AccuracyNext, we applied simple post-processing methods to smooth the model predictions, reduce the number of false positives, and facilitate the detection of coherent seizure events (only models were trained with per-file normalization and local features were used). Three post-processing methods were selected: Dilation-Erosion (D-E), Erosion-Dilation (E-D) and a moving average coupled with a dual threshold (M-DT) (Fig. 4A). For each of these methods three parameters were selected to keep the number of comparisons low (For more details, see methods – Event Detection). Briefly, increasing the erosion factor makes the D-E, E-D methods more stringent by filtering shorter segments, while increasing the window size of the M-DT increases the stringency of the method by smoothing predictions over a broader range.
Fig. 4Effect of post-processing on seizure segment and seizure event detection. A) Schematic of the three post processing methods used: 1) Dilation-Erosion (D-E), 2) Erosion-Dilation (E-D), 3) Moving Average and Dual Threshold (M-DT). Effect sizes calculated using a three-way ANOVA (three parameters) on B) F1 score and C) Balanced accuracy. Only per-file normalization with local feature types were compared. Parameters: 1) model type (DT, GNB, PAC, SGD), 2) post-process method (D-3, E-D, M-DT), 3) feature set (Top 5, Top 10, Top 15—with or without 5 least correlated features. Due to the small effect of feature-sets only one feature set (Top 10) was used for the remaining plots for simplicity (D-I). D-I) Effect of post-processing on D) Balanced accuracy, E) Recall, F) percent seizures detected, G) F1 score, H) Precision, I) False detection rate. Colored lines represent model types: DT-grey, GNB-blue, SGD-green. Solid lines represent model performance. Dotted-lines indicate the performance of each model with no post-processing method
We used a three-way ANOVA to examine the effect sizes of model type, post-processing method, and feature-set. We found that the model type had the largest effect size on F1 score followed by post-processing method (Fig. 4B). Moreover, the post-processing method had the most substantial effect size on balanced accuracy while the contributions of other factors were relatively modest (Fig. 4C). Overall, feature-sets had minimal impact on both F1 and balanced accuracy scores, so we chose one representative feature-set (Top 10) to perform the remaining analyses.
We observed a robust increase in balanced accuracy when the M-DT method was applied compared to raw predictions (None), whereas the D-E and E-D methods had modest effects on balanced accuracy (Fig. 4D). The increase in balanced accuracy resulted from improved recall, as can be observed in Fig. 4E. This increase in recall did not reflect an increase in the number of seizure events detected (Fig. 4F) but was the result of detecting more seizure segments within each seizure event. Importantly, the SGD and GNB models detected 100% of seizures across all post-processing methods besides the most stringent M-DT method (M-DT (10)) (Fig. 4F). On the other hand, most post-processing methods decreased the seizure events detected from the DT model (Fig. 4F). More stringent methods such E-D (2–4) and M-DT (8 &10) resulted in the biggest drops in seizure detection (Fig. 4F).
Additionally, we observed that post-processing methods increased the F1 score when compared to raw predictions (Fig. 4G). E-D methods and M-DT methods resulted in higher F1 scores than the D-E method (Fig. 4G). The increase in F1 score primarily resulted from enhanced model precision as illustrated in Fig. 4H. Importantly, the seizure false detection rate dramatically decreased across post-processing methods compared to raw predictions (Fig. 4I). The M-DT methods had the lowest false detection rate together with E-D (2–4). Crucially, the GNB model had the best F1 score with 100% seizure detection rate.
Given the above findings, we chose the M-DT (6) post-processing method for our subsequent analysis (only models were trained with per-file normalization and local features) and as the default option in SeizyML for the following reasons: 1) It provides a better estimation of the seizure boundaries than other methods (E-D, D-E), 2) It allows detection of 100% percent of seizures in SGD and GNB models, 3) It significantly reduces the false detection rate. In SeizyML, the post-processing method and parameters can be changed by the user to adjust stringency.
The GNB Model is Robust to Misclassification Requiring only a Small Training DatasetNext, we investigated how well the models could train on smaller training datasets (1%, 2%, 5%, 10%, 25%, 50%, 75%, 100% of full training dataset) that could be effective for practical applications of SeizyML. We found that models performed surprisingly well with just 1% of training data (~ 55 h, Fig. 5A). This amount is less than three days of continuous 24/7 EEG recording, the gold standard technique to capture seizures in the field. The GNB model had the best performance at 1% training size as detected 100% of seizures (as opposed to SGD that required 10% of training data to detect all seizures), had the highest F1 score, highest balanced accuracy, and lowest false detection rate (Fig. 5A).
Fig. 5Model robustness across training size and label permutations. A) Effect of training size on mode performance, B) Effect of label permutation (shuffling) on model performance, C) Effect of label permutation on models trained on 1% of full training dataset. Only models with per-file normalization, local feature types, and from the top 10 feature set were used. Colored lines represent model types: DT-grey, GNB-blue, SGD-green. Solid lines represent model performance. Dotted-lines indicate the performance of each model from baseline (1st point on graph)
To examine the robustness of the models to misclassifications we first trained them on the full dataset and then progressively shuffled the ground truth labels (0%, 1%, 5%, 10%, 20%, 50%, 100%). Surprisingly, we found that models maintained a good performance until 50% of the training dataset were shuffled (Fig. 5B) indicating that interpretable models with simple features are robust to misclassifications.
Lastly, we evaluated whether models trained on a small dataset were also robust to misclassifications. We trained models on 1% of the training dataset and increasingly shuffled the ground truth labels. Given the results we observed on shuffling the full training dataset we chose more evenly distributed shuffled percentages: 0%, 5%, 10%, 20%, 40%, 60%, 80%, 100%. The GNB model detected 100% of seizures, with high balanced accuracy and low false detection rate of seizures up to 80% of shuffled labels (Fig. 5C). In contrast, the DT and SGD models exhibited a progressively worsening performance as reflected by increasing false detection rate and decreasing balanced accuracy (Fig. 5C). Overall, these results indicate the GNB model is highly robust to misclassification and can effectively train on small datasets.
Feature Contributions Vary Significantly Between Model TypesTo further understand how these models classified EEG segments we extracted metrics which quantified the influence of each feature on model predictions (See methods – feature contributions; DT: feature importance, SGD: feature weight, GNB: feature separation score) in models trained on 1% of the training dataset. This analysis revealed that in the DT model the line length vHPC emerged as the most significant feature (Fig. 6A). In contrast the SGD model had a more balanced weight across features, with line length vHPC, line length FC, and Beta Power FC being the three most influential features (Fig. 6C). Lastly, the GNB model does not have an in-built metric for feature importance. Thus, we calculated a feature separation score based on the distribution of each feature from the trained GNB models (See methods – feature contributions). This analysis indicated that the GNB model has the most uniform contribution compared to the other two models (Fig. 6A-C) with line length vHPC being the feature with the highest separation score. To examine whether the reduced performance of the DT tree arose from its high reliance on the line length vHPC, we trained a GNB model with only that one feature. We found that GNB model had a high F1 score, with predictions restricted around seizure events (Supplemental Fig. 5). In contrast, the only way that we could restrict the predictions of the DT model was to limit the tree depth to 1 (data not shown) indicating that the DT model was prone to overfitting and struggled to generalize.
Fig. 6Feature importance for classification. A-C) Feature contribution metrics for models trained on 1% of the full training dataset. D-F) Heatmaps showing the difference in feature contribution metrics from 0% label randomization. A,D) DT: Feature importance, B,E) GNB: Feature separation score, C,F) SGD: Feature weight. Each feature metric was normalized to have a sum of 1. Only models with per-file normalization, local feature types, and from the top 10 feature set were used
We also examined how shuffled labels affected feature contributions (Fig. 6D-F). In the DT model the contribution of line length vHPC decreased progressively as the shuffling percentage increased, disappearing entirely at 100% shuffling (Fig. 6D). The effects on SGD and GNB models were more widespread with feature contributions at 100% label randomization being completely different than those at 0% and even 80% shuffling (Fig. 6E-F).
In summary, this analysis shows that the DT model heavily relied on one feature, whereas the GNB and SGD model had more uniform feature contributions. The shuffling experiments further support the robustness of the GNB model and the features regarding label randomization and misclassifications. These findings emphasize the importance of interpretable machine learning approaches and highlight some of the valuable insights that can be gained into model behavior and reliability of model predictions.
Z-score Normalization Per-File Leads to Higher Seizure Detection SensitivityFinally, we examined the effects of normalization on GNB models, as it was a key factor in overall model performance (Fig. 2A-D). Specifically, we tested four different normalization methods (Z-score: StandardScaler, Min–Max: MinMaxScaler, Percentile: RobustScaler, Gaussian: PowerTransformer from Scikit-learn library) using either per-file or all-file normalization strategy across our full mouse dataset. Per-file normalization consistently outperformed all-file normalization, leading to a higher percentage of detected seizures (Fig. 7A), improved F1 scores (Fig. 7B), and a lower false detection rate (Fig. 7C). This improvement was likely due to the increased feature amplitude during seizures in per-file normalization (Fig. 7D). Among the per-file methods, Z-score normalization yielded the highest seizure detection rate (100%), followed closely by Percentile (99.18%) and Gaussian (97.43%), while Min–Max performed the worst performance (78.45%).
Fig. 7Comparison of GNB model performance across normalization strategies. (A-D) Inter-subject classification in mouse dataset: A) Percent seizures detected, B) F1 Score, C) False detection rate. D) Line length vHPC amplitude during seizures is larger with per-file normalization compared to non-seizure periods. Arrows indicate line length peaks: blue for per-file normalization and dark grey for all-file normalization. (E–G) Model metrics for inter-classification in the CHB-MIT dataset: E) Percent seizures detected, F) F1 Score, G) False detection rate. H) Mean amplitude during seizures is larger with per-file normalization compared to non-seizure periods. Arrows indicate line length peaks: blue for per-file normalization and dark grey for all-file normalization
To further test the validity of these normalization methods on human EEG recordings we utilized the Children’s Boston Hospital MIT dataset (CHB-MIT) (Shoeb & Guttag, 2010) as it has been extensively used for benchmarking ML models (Siddiqui et al., 2020). We refined the dataset with the aim of including seizures with reliable EEG profile changes (See methods – Human Dataset). We found that per-file normalization outperformed all-file normalization, leading to improved seizure detection (Fig. 7E), higher F1 scores (Fig. 7F), lower false detection rates (Fig. 7G), and greater feature amplitude during seizures (Fig. 7H). Among the per-file methods, Z-score, Gaussian, and Min–Max normalization performed similarly (92–96% seizure detection), while Percentile normalization had the lowest performance (69.14%).
Thus, across both mouse and human datasets, per-file normalization consistently yielded superior results. Z-score and Gaussian normalization were the most reliable methods. Given the computational inefficiency of Gaussian normalization (PowerTransformer), we have implemented Z-score per-file normalization in SeizyML.
Comments (0)