On the pitfalls of Batch Normalization for end-to-end video learning: A study on surgical workflow analysis

Batch Normalization (BatchNorm/BN) (Ioffe and Szegedy, 2015) is a highly effective regularizer in visual recognition tasks and is ubiquitous in modern Convolutional Neural Networks (CNNs) (He et al., 2016, Szegedy et al., 2016, Tan and Le, 2019). It is, however, also considered a source for silent performance drops and bugs due to its unique property of depending on other samples in the batch and the assumptions tied to this (Brock et al., 2021, Wu and He, 2018, Wu and Johnson, 2021). Most notably, BatchNorm assumes that batches are a good approximation of the training data and only performs well when batches are large enough and sampled i.i.d. (Ioffe and Szegedy, 2015, Wu and Johnson, 2021).

This generally does not hold for sequence learning, where batches contain highly correlated, sequential samples, and has led to the use of alternatives such as LayerNorm (LN) (Ba et al., 2016) in NLP (Shen et al., 2020, Vaswani et al., 2017). In video learning, BN has been studied less (Cai et al., 2021, Wu and He, 2018), despite the use of BN-based CNNs.

In natural-video tasks, CNNs are only used to extract image- or clip-wise features using pretrained CNNs (e.g. Carreira and Zisserman (2017)) off-the-shelf. Only the temporal model, which typically does not contain BN, is trained to aggregate features over time (Abu Farha et al., 2020, Farha and Gall, 2019, Huang et al., 2020, Ishikawa et al., 2021, Ke et al., 2019, Sener et al., 2020, Wang et al., 2020, Yi et al., 2021). However, in specialized small-data domains such as surgical video, well-pretrained CNNs may not be available (Czempiel et al., 2022, Zhang et al., 2022), requiring CNNs to be finetuned, either through 2-stage (Czempiel et al., 2020) or end-to-end (E2E) training (Jin et al., 2017). The latter seems preferable to enable joint learning of visual and temporal features, especially since spatio-temporal feature extractors (e.g. 3D CNNs) have not been effective on small-scale surgical datasets (Czempiel et al., 2022, Zhang et al., 2022). However, BN layers in CNNs pose obstacles for end-to-end learning.

We hypothesize that BN’s problems with correlated, sequential samples have silently caused research in video-based surgical workflow analysis (SWA) to head into a sub-optimal direction. The focus has shifted towards developing sophisticated temporal models to operate on extracted image features similar to the natural-video domain, replacing end-to-end learning with complex multi-stage training procedures, where each component (CNN, LSTM Hochreiter and Schmidhuber, 1997, TCN Farha and Gall, 2019, Transformer Vaswani et al., 2017, etc.) is trained individually (Bano et al., 2020, Gao et al., 2021, Kannan et al., 2019, Marafioti et al., 2021, Yuan et al., 2021). We argue that even simple CNN–LSTM models can often outperform these methods when BN-free backbones are used and the model is trained end to end.

We investigate BatchNorm’s pitfalls in end-to-end learning on two surgical workflow analysis (SWA) tasks: surgical phase recognition (Twinanda et al., 2016) and instrument anticipation (Rivoir et al., 2020). We choose these for two reasons: The lack of well-pretrained feature extractors and ineffectiveness of 3D CNNs signify the need for end-to-end approaches in SWA and thus make BN-issues most relevant here. Further, SWA is one of the most active research areas for online video understanding, where models are constrained to access only past frames, causing additional BN problems regarding leakage of future information.

Our contributions can be summarized as follows:

1.

We challenge the predominance of multi-stage approaches in surgical workflow analysis (SWA) and show that awareness of BN’s pitfalls is crucial for effective end-to-end learning.

2.

We provide a detailed literature review showing BN’s impact on existing training strategies in surgical workflow analysis.

3.

We analyze when BN issues occur, including problems specific to online tasks such as “cheating” in anticipation.

4.

Leveraging these insights, we show that even simple, end-to-end CNN–LSTM models can be highly effective when BN is avoided and strategies which maximize temporal context are employed.

5.

These CNN–LSTMs beat the state of the art on three surgical workflow benchmarks.

6.

We reproduce our findings on natural-video datasets to show our study’s potential for wider impact.

留言 (0)

沒有登入
gif