Alejandra Pérez, Santiago Rodríguez, Nicolás Ayobi, Nicolás Aparicio, Eugénie Dessevres, Pablo Arbeláez
MICCAI (2024)
Abstract
Phase recognition in surgical videos is crucial for enhancing
computer-aided surgical systems as it enables automated understand-
ing of sequential procedural stages. Existing methods often rely on fixed
temporal windows for video analysis to identify dynamic surgical phases.
Thus, they struggle to simultaneously capture short-, mid-, and long-
term information necessary to fully understand complex surgical pro-
cedures. To address these issues, we propose Multi-Scale Transformers
for Surgical Phase Recognition (MuST), a novel Transformer-based ap-
proach that combines a Multi-Term Frame encoder with a Temporal Con-
sistency Module to capture information across multiple temporal scales
of a surgical video. Our Multi-Term Frame Encoder computes interde-
pendencies across a hierarchy of temporal scales by sampling sequences at
increasing strides around the frame of interest. Furthermore, we employ
a long-term Transformer encoder over the frame embeddings to further
enhance long-term reasoning. MuST achieves higher performance than
previous state-of-the-art methods on three different public benchmarks.