Identifying Laryngeal Neoplasms in Laryngoscope Images via Deep Learning Based Object Detection: A Case Study on an Extremely Small Data Set

In the clinical practice of Otolaryngology, the laryngoscope is the most common and important piece of equipment for the examination the larynx structure. Through a laryngoscope, rigid or flexible, lesion of larynx could be found by ENT (Ear, Nose and Throat) practitioner. Doctors' judgment of laryngoscope images depends on clinical experience, and doctors of different seniority from different levels of hospitals may draw different conclusions from the same laryngoscope images. If there exists a reliable automated system for screening laryngoscopic images, it can give preliminary recommendations that can assist doctors in their interpretation of such images.

Artificial Intelligence (AI) has its technical advantages in image recognition, which has been shown in imaging diagnosis of skin cancer [1]. With the updating of technology, objective analysis of laryngoscope images is in full swing. Research of Du et al. [2] applies Artificial Neural Network (ANN) on laryngoscope images to determine the validity of color and texture abnormalities in LPR (Laryngopharyngeal Reflux) system. The disadvantage with this method is that using only two features could omit more substantial information contained in the image.

In recent years, a great number of publications apply computer vision techniques to medical imagery such as radiology, pathology, ophthalmology and dermatology, which is benefited by the growing availability of highly structured images [3]. For example, EchoNet [4] is proposed to recognize cardiac structures and predict the systemic phenotypes which are difficult for human interpretation. For skin images with pathology, DermGAN [5] leveraged Generative Adversarial Nets (GANs) [6] for synthesizing clinical images with skin conditions as data augmentation. Convolutional Neural Network (CNN) is employed in [7] for automated detection of diabetic retinopathy and diabetic macular edema in retinal fundus photographs.

Nevertheless, these works are often supported by extensive medical image collections which are easily accessible. For example, CT images will be automatically preserved in digit form, hence researchers don't have to consciously collect and build the dataset. On the other hand, there are some large-scale public datasets such as MURA of bone X-rays [8] and LUNA16 [9] of lung nodules which can be efficiently used for training and evaluation.

While for laryngoscope images, to the best of our knowledge, no public dataset is available. It's challenging to build a dataset from scratch. Firstly, the number of patients with throat disease is not as large as other diseases, and only a small number of patients are willing to take laryngoscopy images. Secondly, laryngoscope images are not saved in digit form for many hospitals in previous years, which makes it hard to collect enough data. Thirdly, privacy policy further limits the amount of related metadata. Fourthly, the annotations can only be made by senior physicians to ensure the correctness of the label, which increases the cost of annotation.

There're few works toward larynx image recognition. For example, Yao et al. [10] employ a CNN for selecting informative frames from laryngoscopic videos. However, their method is not able to predict the category of neoplasms. The most related work to ours is [11], where a widely-used ResNet-101 model [12] is pretrained on the ImageNet natural image dataset and transferred to classify laryngoscope images. To be more detailed, they built a dataset of 24677 consecutive laryngoscope images in total from 9231 patients. This dataset is further divided into three parts: a training set of 14340 images, a validation set of 5093 images, and a test set of 5234 images. There are five categories in their dataset: Normal, Vocal nodule, Leukoplakia, Benign and Malignancy. Using the ResNet-101 model, they achieved an overall accuracy of 96.24%, a sensitivity of 99.02% and a specificity of 99.36% on the test set.

However, we argue there exist several limitations in the method of [11]. First, their high accuracy was achieved on a large training set (14340 images from 5250 patients) collected over six years (from 2012 to 2017). Currently, their dataset cannot be made publicly available. Hence, other researchers must build their own datasets from scratch, possibly with a small number of laryngoscope images. It would be almost impossible for many small hospitals to build dataset of such large-scale from scratch. It is quite questionable whether the transferred ResNet-101 model can achieve satisfactory performance on these small datasets. Even if they released the model, it cannot be directly employed by others since the different laryngoscope scanning tools can cause huge differences in the domain of images. Second, their image classification model can only report the category of the whole image, lacking interpretability when an otolaryngologist attempts to examine the results of the computer-aided diagnosis system.

To address these problems, we propose a diagnosis method for laryngoscope images under small-scale data. Recently learning on small data [13] has attracted much attentions due to the expensive cost of annotation and training, the wide applicability of real scenarios, and the attempt to pursue Artificial General Intelligence (AGI). For this small data problem, we implement an object detection model rather than conventional image classification methods. We argue that image classification can not focus on the small area of the interested region in a laryngoscopic image, particularly in the setting of small data. In this way, we can simultaneously predict the category and detect the region of neoplasms from the input RGB images. To learn from small data, we propose to use DropBlock as the regularization to prevent overfitting. To demonstrate the effectiveness of our method, we built a dataset of 279 images from 279 patients (one image per patient), which is much smaller than [11]. As shown in Figure 1, our dataset contains five categories that are different from [11]: normal, cyst, nodules, laryngeal carcinoma, and precancerous lesions. We use a rectangular bounding box as the location annotation of the neoplasm (if exists) and the type of neoplasm as the category label.

In summary, our contributions are as follows:

1.

A extremely small dataset with 279 images from 279 patients is built as a more realistic benchmark for laryngology practices.

2.

A method based on object detection is proposed. By outputting both the category label and the location of pathology, our results are more interpretable than other black-box image classification methods which offer only the category.

3.

Experimental results demonstrate the effectiveness of our method. Even on a small-scale dataset, our method achieves 73.00% overall accuracy, which outperforms clinical visual assessments (CVAs) and state-of-the-art automated method.

留言 (0)

沒有登入
gif