Sneaky emotions: impact of data partitions in affective computing experiments with brain-computer interfacing

Recently, a lot of research effort is being paid to Affective Computing [29, 36] in general, and to Brain-Computer Interfaces (BCI) [3, 40] in particular, within the context of emotion recognition with Machine Learning (ML) models. Specifically, researchers have proposed many approaches to collect, analyze, and model electroencephalogram (EEG) signals, with promising results in terms of classification performance; e.g. [1, 4, 5, 24, 25, 30, 39]. Furthermore, with the advent of Deep Learning, more advanced ML models have been proposed over the last few years, with sometimes impressively high recognition performance results being reported. However, unlike what happens in, for example, the Computer Vision community (e.g. [9, 23, 26, 28]), there is a lack of shared protocols and benchmarking practices in the BCI community, which makes the proposed approaches hardly comparable and does not promote or ensures the correctness of a given model or technique. Furthermore, quite often the described experimental methodology lacks details or is ambiguous, which leaves us wonder to what extent the reported performance results have been achieved under fair experimental conditions. Eventually, this status quo does not help researchers with building up on previous work nor selecting the most adequate modelling technique. Therefore, raising awareness of these issues can contribute to improved research practices as well as clearer and more realistic expectations of the potential and current limitations of BCI-based emotion recognition.

Certainly, emotion recognition using BCI signals is a challenging problem, especially when it comes to understanding affective responses towards dynamic contents such as videos, mainly because of the high inter-subject and intra-subject variability [37] and the dynamic nature of videos [18]. Part of these difficulties are lately addressed with techniques such as contrastive learning [34] or domain adaptation [8], following the common idea of explicitly bringing together learned representations of brain signals corresponding to similar emotional responses, even though coming from different subjects.

In the literature, three data regimes are typically considered in affective modeling problems: subject-dependent, subject-independent, and cross-subject. Subject-dependent is considered the most favorable condition, since a personalized ML model is trained on subject-specific data and only data from the very same subject is used for testing the model; so usually the highest performance is achieved under this condition. In the subject-independent case, however, a single model is learned with data from several subjects, who are combined during training and testing. Subject-independent is considered more challenging but also more realistic than the subject-dependent regime. Correspondingly, the reported model performance is usually lower. However, how much data from one subject is used in training is critical to understand whether the merits of the achieved performance corresponds to the generalization ability of the proposed ML model or to the amount of the data from test subjects that has been seen during model training. Finally, the cross-subject scenario is considered the hardest and most useful in practice, since the ML models are tested on data from subjects that were never seen in model training.

Another critical factor that makes emotion recognition using BCI signals a challenging problem is the size of the datasets. BCI datasets are usually small in size, due to the cost of acquiring these signals. This has an impact on the kind of ML models that can be used, since, for example, (deep) neural networks typically require lots of training instances to avoid overfitting. To alleviate this issue, researchers have considered different temporal segments (or chunks) of the BCI signals as independent data points for ML model development. While this certainly helps to increase the number of training and testing samples, there is a potential data leakage issue because neighboring segments are expected to be similar. Therefore, ML models are tested on samples that are very similar to those seen during training. This problem is further exacerbated when those segments overlap.

In this paper, we provide a rigorous analysis of these data partitioning issues. We introduce the “data transfer rate” construct (i.e., how much data of the test samples are seen during model training) and use it to examine data partitioning effects under several conditions. As a use case, we consider EEG signals and videos as input stimuli. First, we study subject-independent data splits, which is relevant for generalized ML models of affective decoding. Second, we study video-independent data splits, which is relevant for affective annotation of multimedia contents. Third, we study time-based data splits, which is relevant for preprocessing and feature extraction in ML. Taken together, our results show that (1) for affective decoding, it is hard to achieve recognition performance above the baseline case (random classification) unless some data of the test subjects are considered in the training partition; (2) for affective annotation, having data from the same subject in training and test partitions, even though they correspond to different videos, slightly increases performance; and (3) later signal segments are generally more discriminative, but it is the number of segments (data points) what matters the most to improve performance. Our findings not only have implications in how BCI signals are managed, but also in how experimental conditions and results are reported in academic papers.

1.1 Related work

The following literature overview is not meant to be exhaustive, given the large body of research existing on emotion recognition with BCI devices, but to illustrate the different reported model performances in order to contextualize the results yielded later in our analysis. As indicated before, we consider EEG signals and videos as input stimuli. We focus on a very popular dataset (DEAP) [19] and on the most popular ML task: binary classification of valence [22, 32, 33]. Valence is a positive or negative quantification of affective appraisal, or the degree an emotion has a pleasant or unpleasant quality [12].

In subject-independent experiments, \(89.83\%\) accuracy is reported by Galvão et al. [13] using a k-NN regressor in a 10-fold cross-validation setting. Keelawat et al. [17] tested Convolutional Neural Networks (CNNs) of 3–7 layers and achieved \(86.87\%\) accuracy with 6 layers and 10-fold cross validation, and \(68.75\%\) accuracy with 4 layers and leaving-one-subject out. Yin et al. [39] combined graph-based CNNs and long short-term memory (LSTM) cells, achieving \(84.81\%\) accuracy. Huang et al. [16] developed a CNN that exploited the differences in patterns between the left and right brain hemispheres, achieving \(68.14\%\) accuracy. Du et al. [8] applied attention to the output of LSTM for the automatic selection of the emotion-relevant EEG channels, and obtained \(69.06\%\) acccuracy. Classification accuracy higher than \(99\%\) is reported with a combination of a Deep CNN (DCNN) and a Support Vector Machines [30]. With a spatio-temporal-spectral network, an accuracy of \(69.38\%\) is obtained [21]. Finally, Xu et al. [38] reported an accuracy of \(67.36\%\) using a combination of Gated Recurrent Unit (GRU) cells and a CNN.

Towards the ideal scenario of callibration-free emotion recognition, where no brain data from a target subject would be required in advance, a few-shot learning study by Bhosale et al. [4] reported average few-shot classification accuracy ranging from \(67.24\%\) (under 5-shot and random sampling) to \(78.12\%\) (under 25-shot and subject-dependent sampling). In a zero-calibration setup, accuracy ranged from \(62.98\%\) (5-shot, subject-dependent) to \(71.68\%\) (25-shot and subject-independent).

In cross-subject experiments, an average accuracy of \(79.99\%\) has been reported by Gupta et al. [14]. Liu et al. [24] explored domain adaptation through subject clustering, achieving an accuracy of \(73.9\%~(\pm 13.54\%)\).

Table 1 Binary valence classification performance on DEAP dataset over the last 5 years

While these results provide a rough idea of the performance range in state-of-the-art methods, it also highlights a significant variability between them and an unclear trend along the years (Table 1). This means that it is difficult to understand the relationship between model complexity and achieved performance. It is therefore hard to judge whether the performance differences are attributed to either improvements in data preprocessing or feature extraction techniques, or to the particular ML approach, or to the data splits used. To shed more light in this regard, in this work we consider constant the data processing and the ML model, and conduct a careful analysis on the relationship between the data splits and recognition performance.

留言 (0)

沒有登入
gif