Page 66 - eBook_Proceedings of the International Conference on Digital Manufacturing V2
P. 66
Proceedings of the International Conference on Digital Manufacturing –
Volume 2
illustrated in Figure 22, the input comprises high-resolution Pap
smear images initially partitioned into non-overlapping patches.
These patches undergo linear embedding, preparing them for
subsequent processing. The embeddings were then sequentially
passed through a hierarchical Swin Transformer backbone,
structured into four stages. Each stage features multiple Swin
Transformer blocks incorporating masked attention mechanisms
and strategic patch-merging operations. This hierarchical design
effectively captures both local and global morphological features
crucial for distinguishing between different cervical cell classes.
Following feature extraction, Mask2Former is integrated as a
visualisation tool. Mask2Former effectively highlights and
delineates the learned discriminative regions within cervical cell
images. This visualisation capability significantly enhances model
interpretability by clearly associating specific cellular structures
with predicted categories, thus providing clinicians with intuitive
visual insights into model predictions and aiding clinical decision-
making processes.
Standard Transformer Architecture
The traditional transformer encoder comprises a stack of N
identical layers. As shown in Figure 23, each layer consists of two
main components: multi-head self-attention (MSA) and a multi-
layer perceptron (MLP). Additionally, Layer Normalisation (LN)
was applied before each sub-module, and residual connections
were used alternatively to preserve gradient flow. The
computations within the lth encoder block are defined as equation
1 and 2:
�
= ( ( −1 )) + −1 (1)
= ( ( )) + (2)
^
�
However, standard MSA calculates relationships between all
tokens globally, resulting in quadratic computational complexity
with respect to the number of tokens. This is inefficient for high-
resolution image processing tasks, such as cervical cell image
classification. To address these limitations, the Swin Transformer
50

