Pose-based tremor type and level analysis for Parkinson’s disease from video

To assess the efficacy of our proposed method, we conducted validation testing on two separate evaluation examinations: the PT classification examination and the tremor rating estimation examination. We carried out our experiments using a Ubuntu 18.04 PC with an NVIDIA 3080. The GPU memory usage for training was minimal, averaging just 1.46 gigabytes. The training process for the TIM-TREMOR dataset took approximately ten hours for the PT classification task and twelve hours for the tremor rating estimation examination. They include the processes of EVM and extraction of human pose features from RGB videos. In terms of real-time application, the PT classification or tremor rating estimation of a 33-s video with 1000 frames only took around 48 s each, which is a feasible time for interactive diagnosis.

The dataset

We test our system using the TIM-TREMOR dataset [24], which is an open dataset consisting of 910 videos of 55 individuals performing 21 tasks. The videos are 18–112 s long. There are 572 videos depicting various forms of tremors, including 105 for parkinsonian tremor (PT), 182 for essential tremor (ET), 88 for functional tremor (FT) and 197 for dystonic tremor (DT). An additional 60 videos (NT) were recorded without convincing tremors during the assessment. The test 278 videos have inconclusive tremor classification results and have been labeled as “Other.” For the tremor rating labels, eight levels from level 0 to 7 are assigned to the individual’s left and right hands, evaluated by Bain and Findley Tremor Clinical Rating Scale [25]. To ensure that there is only one label per video and preserve the characteristics of the video, we combine the labels for individual left and right hands by taking the maximum value of both hands.

Setup

We eliminate inconsistent videos to minimize data noise, specifically videos that only capture motion tasks for a limited number of participants. For the tremor-type classification task only, we remove the videos with uncertain tremor-type labels of “other.” Next, each video is clipped into 100-frame samples, and the number of samples is determined by the duration of the consecutive frames in which the participant was visible and not obscured. Each sample was assigned the label of the original video and treated as an individual sample. We use a voting system to obtain the video-level classification results, which increases the system’s robustness and augments the training sample size [26]. We evaluate our proposed system through individual-based leave-one-out cross-validation. It means each subclip for a single individual is used for testing and excluded from the training set for each iteration. The subclips for each individual are never separated by the training or testing set. The total number of leave-one-out cross-validations are 39 and 55 for tremor-type classification and tremor rating estimation, respectively.

Evaluation metrics

We report the mean values calculated among all leave-one-out cross-validations with the following metrics: accuracy (AC), sensitivity (SE), specificity (SP) and F1-score for the binary classification; AC, macro-average F1-score, SE and SP for the multiclass classification.

Table 1 Comparisons on the tremor-type classification taskTremor-type classification

For this experiment, we first evaluate our system on the binary classification that distinguishes PT labels from non-PT labels, and achieve 91.3% accuracy and 80.0% F1-score. In addition, we validate our method on a more complex multiclass classification task for classifying five types of tremors (PT, ET, DT, FT and NT). Our final system’s per-class tremor-type multiclass classification performance is shown in Fig. 3. It shows a fairly balanced performance on classifying PT, ET, DT and NT, while FT has a lower SE and F1-score, which may be caused by the smallest number of samples in this class. Moreover, the corresponding confusion matrices of the two tasks are displayed in Fig. 4.

Fig. 3figure 3

Per-class multiclass tremor-type classification results

Fig. 4figure 4

Confusion matrices for PT classifications: (Left) binary; (Right) multiclass

Comparison with baseline methods

As this paper is the first work that provides the individual-level evaluation results, we implemented the following video-based PT classification baselines to evaluate the effectiveness of our system: (i) ST-GCN [18]: a spatial–temporal GCNs for human pose data classification; (ii) CNN with 1D convolutional layers (CNN-Conv1D) [27]; (iii) decision tree (DT); and (iv) support vector machine (SVM) [27]. Note that all baseline methods apply the same EVM and pose extraction design. The results of our proposed SPA-PTA and baselines are summarized in Table 1.

The binary classification result shows that our full system consistently outperforms all other methods in all evaluation metrics. Our AC, SE, SP and F1 achieve over 80% on leave-one-out cross-validation, demonstrating the effectiveness and stability of our system in this binary classification task. It is noticeable that our system performs better with only spatial convolution instead of a deeper spatial–temporal convolution design like ST-GCN [18]. The outcome supports that the suggested PCSF block effectively enhances classification reliability and reduces the risk of overfitting in small datasets.

While the full system is initially designed for binary classification, it presents effectiveness and generalizability in the multiclass classification task, surpassing existing methods. A small difference between AC, SE and SP shows that our system performs consistently and effectively at identifying the positive samples and excluding the negative ones. The high macro-average SP exhibited trustworthy effectiveness in correctly recognizing individuals who have a specific type of tremor without wrongly recognizing it as other types of tremor.

Fig. 5figure 5

(a) Average skeleton joints attention across all cross-validations in the PT classification experiment. (b) Attention visualization at a (b\(_1\)) successfully classified frame, and (b\(_2\)) unsuccessful classified frame. The joint labels in (b) correspond to (a)

Ablation studies

We conduct an ablation analysis to assess the effectiveness of the EVM, PCSF block and the entire attention module. From the lower parts of Table 1, the positive effect of the PCSF block and attention module can be illustrated by the decrease in metrics when either the PCSF block or the entire attention module is removed in the two classification tasks. Also, we find that the basic GNN architecture without attention performs better than the CNN-Con1D model for both classification tasks. It highlights the efficacy of learning human pose features in the graph domain as opposed to the Euclidean domain. Besides, the variant of “ours without attention” performs slightly better than “ours without attention and EVM preprocessing,” indicating that the use of EVM could effectively enhance tremors.

Model interpretation

We present the visualization for the average attention value of each body keypoint in Fig. 5a. It is interpreted as the importance level our system considers during the classification process. Our analysis reveals that the attention value is significantly highest on the “Right Wrist” and “Left Wrist,” which suggests that our system prioritizes the wrists’ movements during the task performance. Furthermore, the value associated with the “Neck” is significantly lower than other keypoints. It may be explained by the fact that the participants remained seated during the video recording, resulting in a minimal global variance of the neck joint throughout the experiment.

Tremor rating estimation

For this experiment, we train SPA-PTA with different tremor rating labels without further implementation (e.g., converting the classification layer to a regression layer) to validate our system performance in the tremor rating estimation task. Since the data with tremor ratings 4 and above is insufficient for training via leave-one-out cross-validation (i.e., only 5 individuals out of 55), we validate our system on two different classification settings: (1) classifying ratings [1, 2, 3] and (2) classifying ratings [1, 2, 3+]. The latter is generally a more challenging task since the imbalanced data of the “3+” rating brings bias compared to the former, which does not contain such data (Figs. 6, 7).

Fig. 6figure 6

Confusion matrices for tremor rating estimation: (Left) [1, 2, 3+]; (Right) [1, 2, 3]

Fig. 7figure 7

Per-class tremor rating estimation results

Table 2 Comparisons on the tremor rating taskComparison with Baseline methods

We compare our SPA-PTA to the same baselines in the tremor-type classification task as shown in Table 2. SPA-PTA significantly outperforms the baselines by achieving a macro-average AC of 76.4%, SE of 77.3%, SP of 91.6% and F1-score of 76.7%. An interesting finding is that the machine learning-based method decision tree achieves similar performance to two deep learning-based baselines (i.e., ST-GCN and CNN-Conv1D). It may inform us to tackle the challenge of improving the deep learning models in a relatively small dataset. In addition, although our current model does not show strong robustness in the tremor rating estimation task, the ablation studies from the rows of “Ours” in Table 2 still demonstrate the effectiveness of our PCSF layer and the attention mechanism design. It shows the potential of improving our model and system performance with a more specific architecture design with a more extensive dataset.

Ablation studies

Consistent results at the bottom of Table 2 from the same ablation design as for the PT classification task validate the effectiveness of each system component.

Model interpretation

We similarly visualize the average skeleton joints attention across all cross-validation sets in Fig. 8. Two different data preprocessing approaches provide similar attention results, while the weights obtained by grouping [1, 2, 3] slightly more contribute to “Right Wrist” and “Left Wrist.” This may be due to the increased proportion of low tremor rating videos in this approach compared to grouping [1, 2, 3+]. In addition, we notice that the attention weight distribution of the tremor rating estimation examination is similar to that of the PT classification examination, while the former aggregates more attention on the “Right Wrist” and “Left Wrist” than other joints.

Fig. 8figure 8

Average skeleton joint attention across all cross-validations in tremor rating estimation task

Fig. 9figure 9

Estimated pose comparison between AlphaPose and OpenPose for a sitting and resting PD patient with clinically identified PT on the left side of the body. (a)–(c) are the estimated poses of an example video from AlphaPose, OpenPose and both, respectively. Each colored line with 0.05 transparency represents the connection between joints estimated in each frame. Numbers 1 to 5 correspond to specific joints’ local scaling for intuitive comparison. The raw video frames are referenced in Fig. 10

Pose estimation evaluation

To evaluate the effectiveness of AlphaPose and quantify the pose estimation error, we conduct the following experiments:

Quantitative comparison with ground truth data

To quantify the pose estimation error from different methods, we employ the Lagrangian hand-tremor frequency estimation method [24] to compare MAE (mean absolute error) of the hand-tremor frequencies estimated by AlphaPose and conventional OpenPose features [11] with ground truth (GT) frequency obtained from accelerometer data. As suggested in [24], tremor frequency calculated from reliable estimated pose features should be close to (i.e., ideally within 1 HZ difference) the GT accelerometer data frequency. The MAE from Table 3 indicates that AlphaPose consistently outperforms OpenPose on all listed tasks.

Table 3 MAE comparison between AlphaPose features and OpenPose on the top-10 best-performing tasksQualitative pose visualization and comparison

The visualizations in Fig. 9 and the reference video images in Fig. 10 show that AlphaPose outperforms OpenPose in estimating joint positions. This is supported by the smoother trajectory lines of AlphaPose, which are depicted by the transparent colored lines. Figures 1, 2, 3, 4, 5 and 9 demonstrate AlphaPose’s ability to track joint movement fluidly. Specifically, in Fig. 5, AlphaPose demonstrates a hand trajectory that aligns more closely with the anticipated tremor pattern, which contrasts with OpenPose’s intermittent jumping trajectory. This consistency suggests that AlphaPose may be more reliable for tasks related to PT classification. Furthermore, on the patient’s right side, particularly in Figs. 1 and 2, AlphaPose yields more consistent and stable outcomes, reflecting the patient’s condition of resting with observable tremors only in the left hand, as corroborated by Fig. 10. Finally, the neck joint position of OpenPose is estimated by the mean point of both shoulders, which is less accurate than the estimated neck joint position of AlphaPose [12].

Fig. 10figure 10

Raw videos referenced in Fig. 9 consist of consecutive images captured at intervals of 5 frames, approximately every 0.167 s. The lower right image is an aggregation of five transparent hand images, where the green dot shows the estimated trajectory of the left wrist joint during tremor

Classification performance comparison

We compare the effectiveness of AlphaPose and OpenPose by evaluating their impacts on the system classification performance. Table 4 demonstrates that using AlphaPose features results in a remarkable and consistent improvement over OpenPose features of approximately \(1-3\%\) across the classification tasks except for the binary tremor-type classification. These results highlight the precision of AlphaPose in delivering better pose-based features for classification tasks.

Table 4 Comparisons on the influence of classification performance between AlphaPose and OpenPose

In this study, we utilize the pre-trained AlphaPose model, opting not to retrain it due to the absence of GT 2D pose position annotations within our dataset. The robust generalization capability of the pre-trained AlphaPose model, as evidenced by its superior performance across multiple diverse and complex benchmark datasets [12], affirms its suitability for our task. In the future, we are interested in comparing the performance between pre-trained and tremor-specific pose estimation models. This will entail the collection of the necessary GT data to train a model adept at detecting the subtle nuances characteristic of tremor movement patterns.

留言 (0)

沒有登入
gif