The accuracy of an Online Sequential Extreme Learning Machine in detecting voice pathology using the Malaysian Voice Pathology Database

Study design and study subject selection

This is a cross-sectional study that was conducted for a duration of two years in an academic tertiary laryngology clinic. The ethics board of the institution approved the study prior to data collection. Each participant testified that his/her participation was voluntary and that the decision would not affect the medical care they received.

A total of 382 participants were recruited for the study. The subjects in the study were divided into two groups: the normal voice group and the dysphonic voice group. Video laryngostroboscopy, voice recording, and acoustic analysis are routine procedures for patients with voice problems. The data for the dysphonic group were obtained from a clinic’s database of patients with voice disorders, which included video laryngostroboscopy, voice recording, acoustic analysis, and clinical diagnosis. Data that were incomplete or involved patients who had undergone laryngeal surgery or aphonic patients were excluded from the study.

Participants in the normal voice group were identified among the staff and students of Universiti Kebangsaan Malaysia and screened by using two questionnaires: the Voice Handicap Index-10 (VHI-10) [6, 7] and Reflux Symptom Index (RSI) questionnaire [18]. The inclusion criteria were a VHI-10 score of less than 7.5 [19] and RSI score of less than 13 [18] and age between 18 and 60 years old. The exclusion criteria were previous vocal fold pathology, history of smoking, history of intubation within six months, and history of upper respiratory tract infection within two weeks. Participants who met the screening criteria were further evaluated with video laryngostroboscopy, voice recording, and acoustic analysis to ensure that they were free from any vocal fold pathology. Those who exhibited normal video laryngostroboscopy, voice recording, and acoustic analysis were included in the study.

The collected data (including video laryngostroboscopy and voice recording) were stored in a voice database named MVPD according to the two groups (normal voice and dysphonic voice). For the dysphonic voice group, the diagnoses were classified into two subgroups based on the causes of dysphonia: (1) structural, comprising malignant and premalignant, benign, and inflammatory lesions; and (2) non-structural, consisting of functional and neurogenic dysphonia. To keep the participants anonymous, the files of the collected data were assigned new names. The study methodology is summarized in Fig. 1.

Fig. 1figure 1File name terminology

To ensure the confidentiality of the participants, all voice recordings were given new names with six parts. For the dysphonic group, the first part indicates the patient’s disorder, with ‘ml’ representing malignant, ‘pm’ for premalignant, ‘bn’ for benign, ‘in’ for inflammatory disease, ‘fc’ for functional, and ‘ne’ for neurogenic. For normal subjects, the abbreviation ‘no’ is used. The second part is a numerical code specific to each participant, while the third part denotes the participants’ age. The fourth part indicates the participant’s gender, whereby ‘m’ denotes male and ‘f’ denotes female. Next, the fifth part represents the participant’s race, using ‘mly’ for Malays, ‘chi’ for Chinese, ‘ind’ for Indian, and ‘oth’ for others. The sixth part indicates the 5-s vowel /a/. For example, a voice sample named ‘in-156–28-m-ind-5a’ indicates the participant is a 28-year-old male with an inflammatory condition, and the file is the 5-s vowel /a/. All the collected voices with new names were stored in MVPD.

Evaluation of the Malaysian Voice Pathology Database using an Online Sequential Extreme Learning Machine

The voice pathology detection and classification system using the OSELM technique involves three main phases. The first phase indicates the collection of data and the creation of the proposed MVPD database. The second phase refers to the extraction of the features of voice signals. The third phase denotes the detection and classification sections. Figure 2 shows the flow of voice pathology detection and classification.

Fig. 2figure 2

Flowchart of voice pathology detection and classification using OSELM

Mel-frequency cepstral coefficient

The Mel-Frequency Cepstral Coefficient (MFCC) technique is a tool for feature extraction in speech processing. It is widely used in automatic speech and speaker recognition systems. The process of the MFCC technique includes several steps, such as pre-emphasis, framing, windowing, Fast Fourier Transform (FFT), mel-filter bank, and Discrete Cosine Transform (DCT) [20]. The diagram of the MFCC feature extraction process is shown in Fig. 3.

Fig. 3figure 3

Feature extraction processes based on MFCC [13]

In the pre-processing step, the analog signal is converted into a digital signal and the signal energy increases at a higher frequency, as in the following equation:

$$S_^} = S_ 0.95*S_1}}$$

(1)

where S′n is the new sample value, S is the sample value, and n refers to the sample number. The utterance is then separated into frames, and the Hamming window is applied to each frame. FFT is applied to each frame, with the time-domain signal converted into the frequency-domain signal. The frequency is further converted from Hertz to mel using the following equation:

$$f_} = 2595 \times \log 10\left( } }}}} \right)$$

(2)

Lastly, the DCT is used to convert the log mel spectrum back into time domain. The result of conversion is called the mel-frequency cepstral coefficient [13].

Online Sequential Extreme Learning Machine

OSELM is considered a fast algorithm, and it is able to learn from the training data through a chunk-by-chunk mechanism with constant and varying lengths. The OSELM can be used to predict an unknown input. In the OSELM algorithm, there are three layers or nodes: input layer, hidden layer, and output layer. The input layer has the extracted features, the hidden layer has biases, and the output layer has the final classes of the algorithm. The output matrix (H) of the hidden layer is calculated using the following equation:

$$H = W_ \cdot X_ + B_$$

(3)

where W indicates the input weights that link the input layer to the hidden layer, X refers to extracted features by MFCC in the input layer, and B indicates biases of the hidden layer. The input weights (W) and hidden biases (B) are randomly generated with a range between − 1 and 1. For \(N\) arbitrary distinct samples \(\left( }_}} ,}_}} } \right),}\quad }_}} \in }^}} ,}\,}_}} \in }^}}\), single layer feedforward neural networks (SLFNs) with \(n\) hidden nodes and the activation function \(g(x)\) can be mathematically modeled using the following equation:

$$}(}) = \mathop \sum \limits_}^ \beta _}} }\left( }}} \cdot }_}~}} + }} \right) = }_}} ,\quad } = 1,2, \ldots ,}$$

(4)

Further, Eq. (4) can be compacted and rewritten as follows:

where:

$$H = \left( c} \cdot x_ + b_ } \right)} & \ldots & \cdot x_ + b_ } \right)} \\ \vdots & \ddots & \vdots \\ \cdot x_ + b_ } \right)} & \cdots & \cdot x_ + b_ } \right)} \\ \end } \right)_ ,$$

$$\beta =_^\\ \vdots \\ _^\end\right]}_,T=_^\\ \vdots \\ _^\end\right]}_$$

The output weights (\(\hat\)) is then estimated according to the following equation:

where \(^\) is the Moore–Penrose generalized inverse (pseudo inverse) of the hidden layer output matrix H, and it is calculated as follows:

OSELM is executed to learn the training samples successively and incrementally. The learning process of OSELM consists of two steps: the initialization step and the sequential learning step. In the initialization step, the output matrix of the hidden layer \(}_\) and the output weights of the initial \(_\) are calculated using the equations below:

$$_=g\left(W\cdot _+B\right)$$

(8)

In the sequential learning step, the output matrix of the hidden layer \(_\) is updated for the new sample, as shown in Eq. (12). Furthermore, the output weight matrix \(_}\) is updated according to the following equations:

$$_=_-__^___^\right)}^__$$

(11)

$$_=_+__^\left(_-_^\right)$$

(12)

The set = k + 1 and goes back to Eqs. (8), (11), and (12) to train the next sample. When all samples are trained, the OSELM can be used to predict an unknown input vector. In the OSELM algorithm, the input layer is implemented randomly before further calculations are performed to obtain the output layer and the final results. Figure 4 shows the architecture of the OSELM algorithm, where the final classes are labeled as T0 and T1, which refer to pathological and healthy voices, respectively.

Fig. 4figure 4

Diagram of the OSELM algorithm [13]

To standardize and make our results comparable with other studies, we allocated 80% of the voice samples for training the OSELM algorithm, and the remaining 20% was used for testing the OSELM algorithm [13].

留言 (0)

沒有登入
gif