Automated cervical vertebral maturation staging using deep learning: Enhancing accuracy through random oversampling and memory optimization

INTRODUCTION

Accurate assessment of skeletal maturity is crucial for successful orthodontic treatment.[1,2] Precisely determining the timing of accelerated growth and skeletal development is essential for optimizing treatment outcomes and minimizing the need for complex surgical interventions.[3-5] Beyond orthodontics, bone age assessment is valuable for pediatric and forensic medicine such as determining developmental stage, final height, and legal age in cases where identification is missing.[6,7] This information aids in diagnosing growth disorders, planning treatments, and forensic investigations.[8]

Radiographic analysis is commonly employed to assess skeletal maturation, pubertal development, and growth potential.[9,10] Conventionally, hand-wrist radiographs have served as the gold standard for evaluating skeletal age, offering a standardized method for comparison.[6,11,12] However, this technique necessitates specialized interpretation skills and exposes patients to ionizing radiation.[10,11,13,14] Alternatively, cervical vertebral maturation (CVM) evaluates skeletal maturity by analyzing morphological changes in cervical vertebrae on lateral cephalometric radiographs, a routine orthodontic diagnostic image.[15] Although CVM correlates well with hand-wrist assessments and avoids additional radiation, its application can be complex and time-consuming.[16] Despite these challenges, CVM offers a potential advantage in orthodontic diagnosis by providing a non-invasive method for evaluating skeletal growth and development.[17]

The modified Baccetti’s CVM classification provides a framework for assessing skeletal maturity based on observable changes in cervical vertebral morphology.[18] The six-stage CVM model delineates the morphological changes of C2, C3, and C4 vertebrae throughout adolescence. Stage 1 (Initiation) is characterized by trapezoidal vertebral bodies with flat inferior borders. As growth accelerates (Stage 2), concavity develops on the inferior borders of C2 and C3, with vertebral bodies transitioning to rectangular shapes. Stage 3 (Transition) marks a period of rapid morphological change, with increasing concavity and persistent rectangular shapes. Growth deceleration (Stage 4) is associated with pronounced concavity and the emergence of square-shaped vertebral bodies. Stage 5 (Maturation) signifies skeletal maturity with maximal concavity and square vertebral bodies. Finally, Stage 6 (Completion) represents the cessation of growth, characterized by deepened concavity and potentially vertically elongated vertebral bodies.[19,20]

Continued research is imperative to refine CVM methodology, establish standardized assessment criteria, and integrate it with other diagnostic tools.[21] Artificial intelligence (AI), specifically machine learning (ML), has revolutionized medical image analysis. Deep learning (DL), a subset of ML, employs multi-layered neural networks to learn complex patterns directly from data.[22,23] Convolutional neural networks (CNNs), a type of DL architecture, have excelled in image classification tasks, including medical imaging applications like convolution computation, and backpropagation algorithms which can improve disease diagnosis and forensic analysis.[24]

Early studies employed traditional ML algorithms to classify CVM stages.[25] Most of these studies have compared semiautomated systems to identify landmarks and analyze the CVM stages.[13,26,27] However, recent investigations have increasingly adopted DL techniques, particularly CNNs, due to their superior performance in image analysis.[28] While some studies have compared different CNN architectures for CVM classification, most have utilized existing models such as ResNet and Inception, with limited exploration of newer models.[15] Newer advances in the AI field have explored fully automated systems to eliminate human landmark identification which may be prone to internal and external errors.[29] These studies either used a very complex deep architectural structure or did not develop a new model. If older models were too deep, it might have increased computational cost without improving accuracy. Previous models may have lacked data augmentation techniques such as rotation, flipping, or contrast normalization, leading to overfitting.[30] Imbalanced datasets were common due to certain CVM stages being underrepresented, which affected classification accuracy.[31,32] Nogueira et al.[15] recently compared 4 different CNN models (AlexNet,16 VGG16,17 ResNet18,18, and Inception-v3.19) and Shoari et al.[29] compared a custom model to ResNet 18 for CVM analysis. Both studies found that the model’s performance could be enhanced through more extensive data augmentation to improve robustness. The subtle differences between CVM stages, coupled with low image quality and imbalanced dataset distribution, posed challenges. Incorporating additional preprocessing layers, expanding the patient age range, and utilizing expert-labeled data could further optimize the model. In addition, exploring different CNN architectures and hyperparameter optimization techniques may yield improved results.

In real-world clinical applications, DL models frequently encounter challenges related to dataset imbalance, where certain classifications naturally occur more often than others. This imbalance can result in biased predictions, favoring dominant classes while underperforming underrepresented ones. To address this, random oversampling (ROS) was applied to enhance class balance, ensuring more effective learning across all categories.[33] In addition, early stopping was implemented to prevent overfitting, a common issue in medical AI models trained on limited datasets. A refined training-validation split and a dedicated unseen test set further improved model generalization. These enhancements contribute to a more robust and adaptable approach, not only for CVM classification but also for broader clinical applications where handling class imbalance and optimizing model reliability are essential.[31]

This study aims to address this gap by proposing a new CNN model for automated CVM stage classification on lateral cephalometric radiographs and evaluating its performance rate in detecting CVM processes.

MATERIAL AND METHODS

A sum of 922 digital lateral cephalometric radiographs was acquired from individuals aged between 7 and 20 years for the purpose of pre-orthodontic assessment and treatment planning. Archived radiographs were obtained for research purposes between October 1, 2023, and April 1, 2024, from Radiology Unit, Faculty of Dentistry, Universiti Teknologi MARA (UiTM). Ethical approval was obtained by the UiTM Research Ethics Committee reference number REC/06/2023 (PG/MR/205) which waived the requirement for informed consent due to the retrospective nature of the study. Data management and analysis were conducted in accordance with the principles outlined in the International Council for Harmonisation Good Clinical Practice Guidelines, Malaysian Good Clinical Practice Guidelines, and the Declaration of Helsinki.

The chronological age of the subjects was determined by subtracting their birth date from the date the radiographs were captured. Only radiograph images devoid of artifacts and distortions, and with clear visibility of the C2, C3, and C4 vertebrae, were considered for inclusion in the investigation. All lateral cephalometric radiographs used in the research were obtained using the X-ray unit with a standardized protocol (73 kVp, 15 mA, and 14.9 s exposure time) adhering to the manufacturer’s guidelines for positioning and irradiation. The analysis of the images was conducted on a 24-inch medical display monitor (Philips, Luchu Hsiang, Taiwan) equipped with an NVDIA QUADROFX 380 graphics card to ensure an optimal visual representation.

The evaluation of CVM was independently conducted by two orthodontists (NHN and NA), each with more than 12 years of research and orthodontic clinical experience. Before grading, the orthodontists underwent training and calibration to ensure consistency. The inter-observer agreement was assessed using the kappa coefficient. To minimize fatigue and maintain accuracy, cephalometric images were assessed in multiple sessions. In cases of image labeling discrepancies among observers, a consensus was reached through re-examination for a final decision.

The methodology for building, training, and evaluating the custom CVM stage (CVMS) classification model is detailed as follows: [34]

Data preprocessing

The dataset used in this study comprises images categorized into six distinct classes. To mitigate class imbalance, ROS was employed to increase the number of samples in underrepresented classes, ensuring a more uniform distribution across all categories. Class imbalance, where certain classes contain significantly fewer samples, can lead to biased model training, making the model overly sensitive to majority classes while reducing its ability to recognize minority classes. To prevent distribution skew, where oversampling unintentionally creates a new majority class, the number of generated samples was capped at the original majority class size. This strategy preserved dataset balance without introducing artificial bias or overfitting risks.

To standardize input dimensions and enhance computational efficiency, all images were resized to 128 × 128 pixels and processed in batches of 32 during training and validation. Data preprocessing and model training were conducted on a Dell Precision 5690 workstation equipped with an Intel Core Ultra 9 185H processor (2.50 GHz), 32 GB of RAM, and an NVIDIA QUADROFX 380 GPU and a 64-bit x64-based architecture, ensuring robust computational performance. The preprocessing pipeline integrated TensorFlow/Keras for image augmentation, OpenCV for resizing and manipulation, NumPy for numerical operations, and Pandas and Matplotlib for data analysis and visualization. This combination of preprocessing techniques, high-performance hardware, and optimized software facilitated an efficient, scalable, and reproducible workflow, ultimately improving model generalization and computational efficiency.

Custom model building with ROS

Before finalizing the custom model architecture, we performed hyperparameter tuning using a random search to identify the optimal configuration for our image classification task. Hyperparameter tuning involves experimenting with different values for various model parameters to enhance performance and achieve better results. The model was compiled using the Adam optimizer, known for its adaptive learning rate capabilities, making it well-suited for image classification tasks. The loss function used was categorical cross-entropy, appropriate for multi-class classification, and accuracy was chosen as the evaluation metric.[35]

Hyperparameter tuning with random search

Random search was used to optimize key hyperparameters by randomly sampling values from predefined ranges, allowing for a more efficient exploration of potential configurations compared to grid search. The tuning process focused on critical parameters, including the learning rate, batch size, number of units in dense layers, dropout rate, and convolutional layer parameters such as the number of filters and kernel size. The learning rate, which controlled the step size for updating model weights, was varied between 0.0001 and 0.1 to balance convergence speed and stability. The batch size, determining how many training samples were processed before updating the model’s parameters, was tested in the range of 16–128 to identify an optimal trade-off between training speed and generalization. The number of neurons in the fully connected dense layers varied between 64 and 512, influencing the model’s ability to learn complex patterns.

To prevent overfitting, the dropout rate was adjusted between 0.2 and 0.5, helping to regulate model complexity by randomly deactivating a fraction of neurons during training. In addition, the convolutional layers, responsible for feature extraction, were optimized by varying the number of filters from 32 to 256 and testing kernel sizes of 3 × 3 and 5 × 5 to determine the most effective spatial feature extraction. A search space was defined for each hyperparameter, and random sampling was used to evaluate different configurations. Each selected combination was used to train the model, and performance was assessed on a validation set. The final model architecture consisted of four convolutional layers with optimized filter sizes, followed by batch normalization, max-pooling layers, and fully connected dense layers with an optimal neuron count and dropout rate. The best-performing configuration was selected based on validation accuracy and loss trends, ensuring improved generalization and robustness in CVMS classification. After assessing the performance metrics, the combination of hyperparameters that produced the best results was selected.[36]

Final model architecture

Based on the results of the hyperparameter search, the final model architecture is designed as follows: The model starts with an input layer that processes only the cropped cervical vertebrae (CV2, CV3, and CV4), ensuring the AI focuses exclusively on CVM-relevant regions. Each input image is resized to 128 × 128 pixels with three color channels (RGB) to maintain uniformity and optimize computational efficiency. To enhance contrast and improve feature extraction, normalization techniques are applied, followed by data augmentation (including rotation, flipping, and brightness adjustments) to improve model generalization.

The feature extraction process begins with a convolutional layer containing 96 filters and a kernel size of 3, which captures initial patterns in the vertebral structures. This is followed by a second convolutional layer with 112 filters and a kernel size of 3, further refining extracted features. To reduce spatial dimensions and minimize parameters, a global average pooling layer aggregates information from the feature maps. A fully connected dense layer with 112 units and Rectified Linear Uni (ReLu) activation follows, introducing non-linearity to enhance the model’s learning capacity. To mitigate overfitting, a dropout layer (rate = 0.2) randomly deactivates a fraction of units during training.

Finally, the model concludes with an output layer of six neurons, each corresponding to one of the six CVMSs, with a softmax activation function generating probability distributions across the classes. By integrating region-specific preprocessing, data augmentation, and an optimized DL architecture, the model ensures precise feature extraction, robust learning, and accurate CVMS classification.[33]

Training the custom model

The custom model was trained on the training dataset using the fit method, which performs backpropagation to iteratively update the model’s weights over multiple epochs. For this study, a custom DCNN model was designed, with hyperparameters optimized using random search, and trained for 100 epochs with the Adam optimizer and categorical cross-entropy loss. Early stopping is a widely used regularization technique in DL that prevents overfitting by halting training once the model’s performance stops improving. It works by monitoring a specific metric, such as validation loss or validation accuracy, and stopping training if no significant improvement is observed after a predefined number of epochs. For instance, if validation loss does not decrease for five consecutive epochs, training is terminated, and the model reverts to the best-performing weights to ensure optimal generalization. In addition, a minimum improvement threshold (min delta) can be set to avoid stopping too early due to minor fluctuations in performance. This approach helps maintain model efficiency by reducing unnecessary training cycles while ensuring that the model does not memorize noise from the training data. By applying early stopping, DL models, especially in medical imaging applications, can achieve better generalization and robustness, particularly when trained on limited or imbalanced datasets. During each epoch, the training dataset was divided into mini-batches, and the model was exposed to these mini-batches sequentially, allowing it to learn and update weights gradually.

After each epoch, the model’s performance was evaluated on a separate validation dataset. This evaluation helps monitor for overfitting and ensures that the model’s performance generalizes well to unseen data. Performance metrics such as training and validation accuracy and loss for each epoch were recorded, providing insights into the model’s learning progress and effectiveness.

This recorded training history was used for detailed analysis and to make any necessary adjustments to improve the model’s performance. The validation dataset results guided decisions to refine the model and ensure that it was learning effectively without overfitting.[37]

Refined training methodology and data preprocessing strategies

Further improvements were implemented to refine the model’s training methodology and enhance classification performance. One of the key modifications was adjusting the training validation split from 80-20 to 70-30, providing a larger validation set for a more reliable performance assessment. This adjustment allowed the model to generalize better and reduced the risk of overfitting by exposing it to a more diverse set of validation samples.

The ROS was applied before training rather than during or after data processing. This ensured that all CVMS stages had sufficient representation in the training data, allowing the model to learn distinctive features for each stage more effectively. Another critical strategy involved resetting the model’s weights and clearing the computational graph before each training session. A memory reset function was applied after ROS and before model training to optimize resource management. This approach reduced memory accumulation, improved training stability, and enhanced overall classification performance using K.clear_session() in Keras. By invoking this function, each training cycle started fresh, preventing unwanted bias accumulation from previous runs. This technique ensured that the model effectively learned from the newly structured dataset, enhancing its stability and generalization.

The training methodology was further refined through optimized hyperparameter tuning, focusing on adjusting the learning rate, batch size, dropout rate, and the number of units in dense layers. Early stopping was also employed to prevent overfitting and halting training when validation loss plateaued. These refinements contributed to a more stable and efficient training process, ensuring that the model learned meaningful patterns for accurate classification.

Evaluation

Evaluation metrics included loss and accuracy, with validation accuracy indicating how well the model performed on data not used during training. To gain a deeper understanding of the model’s performance across different classes, the confusion matrix was computed. This matrix provides detailed insights into the number of true positive, false positive, false negative, and true negative predictions for each class, revealing any specific classes where the model might be struggling. The predicted labels for the validation set were obtained using the predict method, and these predictions were compared with the true labels to construct the confusion matrix.[38]

Statistical analyses were performed using IBM Statistical Package for the Social Sciences Statistics 23.0. Inter-rater reliability was assessed with the kappa coefficient, with values interpreted as follows: Slight (0.01–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), and almost perfect (0.81–0.99) agreement.

RESULTS

The Kappa coefficient of 0.87 indicates an almost perfect level of agreement between the two observers in determining CVMSs, suggesting high reliability and consistency in the data. [Table 1] presents the demographic breakdown of the patient population, along with the distribution of CVMSs identified through visual assessment.

Table 1: Descriptive statistics of the patient’s age and cervical vertebral maturation stage.

CVM stages Mean age (Years)±SD n(%) CVM Stage 1 7.59±1.64 32 (3.47) CVM Stage 2 9.75±1.53 47 (5.10) CVM Stage 3 10.61±1.24 97 (10.52) CVM Stage 4 12.16±1.31 196 (21.26) CVM Stage 5 13.62±1.35 296 (32.10) CVM Stage 6 17.21±3.04 254 (27.55) Total 11.82±3.35 922 (100)

Initially, the custom model was developed without implementing ROS. During training, the model achieved a perfect accuracy of 100%. However, this high accuracy did not translate to validation performance, which was only 57%. This discrepancy suggests that while the model fits the training data exceptionally well, it struggled with generalization, likely due to class imbalance or insufficient representation of minority classes. [Figure 1] shows learning curves of training and validation accuracy without applying the ROS.

Learning curves of training and validation accuracy without random oversampling. Figure 1: Learning curves of training and validation accuracy without random oversampling.

Export to PPT

To address the class imbalance, ROS was applied. Initially, the dataset had varying numbers of images per class, ranging from 32 images for class CVMS 1 to 296 images for class CVMS 5. ROS increased the number of images in each class to 296, resulting in a final dataset of 1420 images for training and 356 images for validation. This increase in dataset size explains the discrepancy between the image count mentioned in different sections of the study. While the dataset originally contained fewer than 1000 images, the application of ROS expanded it to over 1700 images for training and validation, ensuring a balanced distribution across all six classes. [Figure 2] illustrates the confusion matrix for the custom model before applying ROS, where significant misclassifications were observed, particularly among classes CVMS 2 through CVMS 6.

Confusion matrix for the custom model without random oversampling. cvms: Cervical vertebral maturation stage. Figure 2: Confusion matrix for the custom model without random oversampling. cvms: Cervical vertebral maturation stage.

Export to PPT

The training process does not start from scratch for each stage; instead, the model is trained in a single end-to-end session, learning to classify all six CVMSs simultaneously. Before each training run, previous training states are erased by reinitializing the model’s weights and reloading the dataset. Specifically, this is achieved by resetting the model’s weights, clearing the computational graph using K.clear_ session() in Keras, and reloading the balanced dataset before retraining. This ensures that each training session begins without residual knowledge from previous runs, allowing the model to learn from the newly structured dataset effectively.

The hyperparameter search has been completed, yielding the optimal configurations for the model. [Figure 3] presents the optimal hyperparameters identified through the search. For the first convolutional layer, the optimal number of filters was determined to be 96, with a kernel size of 3. This configuration was selected to effectively capture initial features from the input images. The second convolutional layer was optimized with 112 filters and the same kernel size of 3, enhancing the model’s ability to refine and extract more complex features.

Detailed layer configuration of the custom model. Figure 3: Detailed layer configuration of the custom model.

Export to PPT

In the fully connected dense layer, the optimal number of units was found to be 112. This setting helps the model learn intricate patterns and relationships in the data. To mitigate overfitting, a dropout rate of 0.2 was identified as optimal, providing a balance between regularization and model capacity. These hyperparameters were chosen to maximize the model’s performance and efficiency in handling the image classification task, ensuring robust feature extraction and effective learning.

During the training of the model over 100 epochs, significant improvements in performance were observed. Initially, the model achieved a loss of 1.7053 and an accuracy of 33.24% on the training set, with a validation accuracy of 50.56%. By the fifth epoch, training accuracy had increased to 85.92%, with validation accuracy rising to 79.78%. As training progressed, accuracy continued to improve, reaching 97.39% by epoch 10, although validation accuracy experienced fluctuations, peaking at 87.36% in epoch 15. After the 50th epoch, training was stopped due to the early stopping function. Despite some variability in validation accuracy, ranging between 82.30% and 87.36%, the model achieved a final accuracy of 85.96% by epoch 50, indicating a robust model with good generalization capabilities. [Figure 4] shows the learning curve for classifying CVMS, illustrating the model’s performance.

Learning curves of training and validation accuracy with random oversampling. Figure 4: Learning curves of training and validation accuracy with random oversampling.

Export to PPT

The results presented reflect the model’s performance after testing on the validation dataset. [Table 2] illustrates the classification report obtained after the model evaluation. For CVMS 1, the model achieved perfect accuracy and recall, indicating flawless identification of this class with very few false positives or false negatives. The CVMS 2 also performed exceptionally well, with a high precision of 95.8% and a recall of 98.6%, suggesting strong predictive capability and reliable classification. The CVMS 3 demonstrated balanced performance with a precision and recall of 96.0%, reflecting an effective and consistent ability to identify this class.

Table 2: The precision, recall, and F1-score values calculated according to the confusion matrix.

Stage Precision Recall F1-score 1 0.98 1.00 0.99 2 0.96 0.99 0.97 3 0.96 0.96 0.96 4 0.85 0.88 0.87 5 0.77 0.71 0.74 6 0.79 0.80 0.79 Accuracy 0.88 0.88 0.88

However, CVMS 4 showed slightly lower precision at 85.7% and recall at 87.5%, pointing to some challenges in minimizing false positives and negatives. The performance of CVMS 5 was notably weaker, with a lower precision of 77.0% and recall of 71.2%, indicating difficulties in accurately classifying this class, which may be due to complex feature differentiation. The CVMS 6 had moderate performance with precision and recall around 79.0% and 80.0%, respectively, suggesting reasonable effectiveness but room for improvement.

The overall classification accuracy of the model is 88.2%, reflecting a generally strong performance. The macro and weighted averages further confirm that the model is performing well across different classes, with the macro average indicating balanced performance across all classes and the weighted average accounting for class imbalances. The classification report highlights that while the model excels in several classes, attention should be given to improving the classification of CVMS 5 to enhance overall robustness. The confusion matrix in [Figure 5] reveals varied performance across classes. This study focuses on classifying CVMSs rather than analyzing the shape and outline of individual vertebrae. Since the regions of interest (ROI) were pre-cropped to include only the relevant cervical vertebrae, the model is trained specifically on these maturation stages, eliminating the need for separate vertebra-level analysis.

Confusion matrix for custom model with random oversampling. cvms: Cervical vertebral maturation stage. Figure 5: Confusion matrix for custom model with random oversampling. cvms: Cervical vertebral maturation stage.

Export to PPT

Further improvements were achieved by refining the training approach, adjusting the training-validation split to 70–30, and applying ROS before training. These modifications led to a significant performance boost, with the model achieving an overall accuracy of 90%. [Figure 6] shows the classification report for the proposed model. The macro and weighted average F1-scores of 0.90 indicate strong classification consistency across all CVMS stages. One of the key improvements was the balancing of class distribution using ROS. This technique ensured that all CVMS stages had sufficient representation, leading to a noticeable improvement in recall, particularly for CVMS 5 (0.85 vs. previous 0.71). This directly addressed previous misclassification issues, where CVMS 5 was frequently confused with adjacent stages due to an insufficient number of training samples.

Improved classification report for proposed model. CVMS: Cervical vertebral maturation stage. Figure 6: Improved classification report for proposed model. CVMS: Cervical vertebral maturation stage.

Export to PPT

In addition, K.clear_session() was applied before each training session to reset the model’s weights and clear the computational graph. This ensured that every training cycle started fresh, preventing unwanted bias accumulation and allowing the model to learn effectively from the newly structured dataset. The refined training strategy also contributed to higher precision and recall in later CVMS stages, which previously exhibited performance inconsistencies. CVMS 4’s precision improved to 0.95, reducing false positives, while CVMS 6’s recall increased to 0.94, indicating better sensitivity in detecting skeletal maturity progression. These enhancements resulted in a more robust and stable model capable of distinguishing all CVMS stages with improved reliability. By integrating these refinements, the model demonstrated greater classification robustness and stability, particularly in distinguishing the more challenging CVMS 5 stage. The applied techniques further reinforce the model’s adaptability for other medical classification tasks, highlighting its potential real-world clinical applicability beyond CVM staging.

The accuracy comparison graph, presented in [Figure 7], illustrates the performance of the proposed method against several pre-trained models, including InceptionV3, ResNet50, and MobileNetV2, under identical experimental settings. The same training parameters, dataset preprocessing, and hyperparameter configurations were applied to ensure a fair comparison across all models. The proposed model achieves the highest training accuracy, approaching 100%, while maintaining a stable validation accuracy above 85%. The minimal gap between the training and validation curves indicates strong generalization, suggesting that the model effectively learns the classification patterns without significant overfitting.

Accuracy comparison of the proposed model versus pre-trained models (InceptionV3, ResNet50, and MobileNetV2) under identical experimental settings. Figure 7: Accuracy comparison of the proposed model versus pre-trained models (InceptionV3, ResNet50, and MobileNetV2) under identical experimental settings.

Export to PPT

Among the pre-trained models, InceptionV3 exhibits lower training accuracy, peaking around 40–50%, with fluctuating validation accuracy, indicating unstable learning. ResNet50 performs the worst, struggling to surpass 20% in both training and validation accuracy, suggesting that it fails to learn effectively under the given settings. MobileNetV2 shows moderate performance, with training accuracy steadily increasing beyond 60%. However, its validation accuracy remains inconsistent and lower than the proposed model, suggesting weaker generalization capability.

[Figure 8] presents the classification report for the unseen dataset. The model achieved an overall accuracy of 74%, with macro and weighted average F1 scores of 74% and 79%, respectively. CVMS1 and CVMS3 showed perfect precision (100%) but lower recall (75% and 50%), suggesting underprediction of these classes. CVMS5 had the highest recall (100%) but a slightly lower precision (80%), indicating some misclassifications as CVMS5. CVMS2 performed weakest, with 50% precision and 67% recall, often being confused with other classes. CVMS4 and CVMS6 exhibited moderate performance, with precision and recall between 60% and 75%.

Classification performance on unseen dataset. CVMS: Cervical vertebral maturation stage. Figure 8: Classification performance on unseen dataset. CVMS: Cervical vertebral maturation stage.

Export to PPT

DISCUSSION

Resizing the training and validation images is essential for several reasons. First, ML models, particularly CNNs, require input images of the same size for the architecture to function correctly. Resizing ensures uniform dimensions, leading to more consistent and reliable training outcomes. In addition, smaller, uniformly sized images reduce computational load and memory usage, making the training process faster and more efficient. Using a standard size like 128 × 128 ensures uniformity across the dataset, which is crucial for model training. It avoids the introduction of bias that could occur if images of varying sizes were used. Besides, smaller image dimensions reduce the computational load and memory usage during training. By resizing to 128 × 128, the model can process images more quickly, which is particularly important when working with large datasets or complex models. In clinical applications, if input images differ in size, it may impact model accuracy due to variations in spatial features. While the model was trained with this fixed size, resizing clinical images to 128 × 128 before inference helps maintain accuracy and ensures compatibility with the trained network. Deviations from this resolution may require additional preprocessing steps or adaptive resizing techniques to preserve model performance.

Standardizing image sizes also prevents inconsistencies that could affect the model’s learning process. For CNNs, resizing ensures that the model can consistently extract relevant features across all images. Furthermore, resizing to the specific size used by pre-trained models (e.g., 224 × 224) ensures compatibility, allowing for effective transfer learning and leveraging pre-trained weights. Overall, resizing is a crucial pre-processing step that optimizes the dataset for better model performance and efficient use of resources.

Developing a network from scratch without annotations offers flexibility, deep understanding, and optimization opportunities, enabling tailored architectures for specific tasks. It fosters skills enhancement, allows for unsupervised or self-supervised learning, encourages innovative approaches, and provides full control over the data processing and training pipeline, making it ideal for research and experimentation despite requiring significant effort and expertise.

ROS aims to balance the class distribution by generating synthetic samples for minority classes, providing a more representative dataset. This adjustment is intended to improve the model’s generalization capabilities and enhance validation performance, thereby reducing the gap between training and validation accuracy and leading to more reliable classification results across all classes. When performing ROS, the classification performance of the majority class can sometimes be quite low due to several factors. First, ROS can lead to overfitting on the minority class as the algorithm might memorize the duplicated instances rather than learn generalizable patterns, resulting in poor performance on unseen data, especially affecting the majority class. Second, although oversampling balances the class distribution, it does not create new information, and the oversampled data might not represent the underlying distribution of the classes well, causing difficulty in generalization. In addition, oversampling can amplify noise present in the minority class; duplicating noisy instances makes the model more prone to misclassify the majority class. Furthermore, by balancing the classes, the model might shift its focus toward the minority class, leading to better performance for the minority class but potentially lowering the performance of the majority class.

Initially, the model exhibited a high training accuracy of 100%, but validation accuracy was substantially lower at 57%, indicating issues with class imbalance. To address this, ROS was implemented. This technique increased the number of images for each class to balance the dataset, enhancing its representation and improving model training. As a result, the final dataset consisted of 1420 images for training and 356 for validation. The application of ROS led to a notable improvement in the model’s performance metrics, addressing the previously observed misclassifications, particularly in classes CVMS 2 through CVMS 6. The confusion matrix and performance metrics post-ROS demonstrated a more balanced and effective classification across all classes, reducing the number of misclassifications and improving overall model robustness.

Li et al.[37] collected an extensive dataset of 10,200 radiographs, significantly larger than previous studies. They utilized YOLOv3 for detecting ROI and achieved an overall accuracy of 70%. Their approach highlighted the potential of DL in CVM classification and suggested incorporating additional factors, such as intervertebral disc space and dental age, for further improvements. In contrast, our custom network with ROS implementation achieved superior accuracy, with an overall performance of 88%. Although our customized model showed improvements in overall classification, it still faced challenges in accurately identifying CVMSs 5 and 6. These stages are the majority classes in the dataset, and while ROS effectively balances the dataset by oversampling minority classes, it does not modify the intrinsic distribution of the majority classes. As a result, the model may become more adept at identifying underrepresented stages but struggle to distinguish frequently occurring ones, such as CVMSs 5 and 6, which exhibit subtle structural differences. In addition, the complexity of these later stages, where skeletal maturity progresses with finer variations, adds another layer of difficulty. While ROS improves minority class representation, it does not enhance the model’s ability to differentiate subtle variations within majority classes. Consequently, the model may still struggle with frequently occurring stages that require more nuanced feature extraction. To mitigate this issue, alternative techniques such as synthetic data generation, adaptive weighting strategies, or focal loss could be explored. These approaches would enhance feature diversity, allowing the model to learn more discriminative patterns and improve its classification performance for harder-to

Comments (0)

No login
gif