Evaluation of prediction and classification performances in different machine learning models for patient‐specific quality assurance of head‐and‐neck VMAT plans

Purpose

The purpose of this study is to evaluate the prediction and classification performances of the gamma passing rate (GPR) for different machine learning models and to select the best model for achieving machine learning-based patient-specific quality assurance (PSQA).

Methods

The measurement verification of 356 head-and-neck volumetric modulated arc therapy plans was performed using a diode array phantom (Delta4 Phantom), and GPR values at 2%/2 mm with global normalization and 3%/2 mm with local normalization were calculated. Machine learning models, including ridge regression (RIDGE), random forest (RF), support vector regression (SVR), and stacked generalization (STACKING), were used to predict the GPR. Each machine learning model was trained using 260 plans, and the prediction accuracy was evaluated using the remaining 96 plans. The prediction error between the measured and predicted GPR was evaluated. For the classification evaluation, the lower control limit for the measured GPR and lower control limit for predicted GPR (LCLp) was defined to identify whether the GPR values represent a “pass” or a “fail.” LCLp values with 99% and 99.9% confidence levels were calculated as the upper prediction limits for the GPR estimated from the linear regression between the measured and predicted GPR.

Results

There was an overestimation trend of the low measured GPR. The maximum prediction errors for RIDGE, RF, SVR, and STACKING were 3.2%, 2.9%, 2.3%, and 2.2% at global 2%/2 mm and 6.3%, 6.6%, 6.1%, and 5.5% at the local 3%/2 mm, respectively. In the global 2%/2 mm, the sensitivity was 100% for all the machine learning models except RIDGE when using 99% LCLp. The specificity was 76.1% for RIDGE, RF, and SVR and 66.3% for STACKING, however, the specificity decreased dramatically when 99.9% LCLp was used. In the local 3%/2 mm, however, only STACKING showed 100% sensitivity when using 99% LCLp. The decrease in the specificity using 99.9% LCLp was smaller than that in the global 2%/2mm, and the specificity for RIDGE, RF, SVR, and STACKING was 61.3%, 61.3%, 72.0%, and 66.8%, respectively.

Conclusions

STACKING had better prediction accuracy for low GPR values than other machine learning models. Applying LCLp to a regression model enabled the consistent evaluation of quantitative and qualitative GPR predictions. Adjusting the confidence level of the LCLp helped improve the balance between the sensitivity and specificity. We suggest that STACKING can assist the safe and efficient operation of PSQA.

This article is protected by copyright. All rights reserved

Comments (0)

No login
gif