Page 66 - eBook_Proceedings of the International Conference on Digital Manufacturing V2
P. 66

Proceedings of the International Conference on Digital Manufacturing –
                                         Volume 2

               illustrated in Figure 22, the input comprises high-resolution Pap
               smear images initially partitioned into non-overlapping patches.
               These patches undergo  linear embedding, preparing them for
               subsequent processing. The embeddings were then sequentially
               passed through a  hierarchical  Swin Transformer backbone,
               structured  into  four  stages.  Each stage  features  multiple  Swin
               Transformer blocks incorporating masked attention mechanisms
               and strategic patch-merging operations. This hierarchical design
               effectively captures both local and global morphological features
               crucial for distinguishing between different cervical cell classes.
               Following feature extraction, Mask2Former is integrated as  a
               visualisation tool. Mask2Former effectively highlights and
               delineates the learned discriminative regions within cervical cell
               images. This visualisation capability significantly enhances model
               interpretability by clearly associating specific cellular structures
               with predicted categories, thus providing clinicians with intuitive
               visual insights into model predictions and aiding clinical decision-
               making processes.

               Standard Transformer Architecture

               The  traditional transformer encoder  comprises  a stack of N
               identical layers. As shown in Figure 23, each layer consists of two
               main components: multi-head self-attention (MSA) and a multi-
               layer perceptron (MLP). Additionally, Layer Normalisation (LN)
               was  applied before each sub-module, and residual connections
               were  used  alternatively  to  preserve  gradient flow. The
               computations within the lth encoder block are defined as equation
               1 and 2:

                  �     
                        =              (         (          −1 )) +           −1          (1)

                       =              (         (     )) +                     (2)
                                   ^    
                       
                                              
                                         �

                  However, standard MSA calculates relationships between all
               tokens globally, resulting in quadratic computational complexity
               with respect to the number of tokens. This is inefficient for high-
               resolution  image processing tasks, such as  cervical  cell image
               classification. To address these limitations, the Swin Transformer



                                              50
   61   62   63   64   65   66   67   68   69   70   71