Multiscale Feature Learning Using Co-Tuplet Loss for Offline Handwritten Signature Verification

Fu-Hsien Huang and Hsin-Min Lu The authors are extremely grateful for the work of Chia-Chun Ku during the early stages of the research. (Corresponding author: Hsin-Min Lu.)Fu-Hsien Huang is with the Department of Information Management, National Taiwan University, Taipei 106, Taiwan (e-mail: [email protected])Hsin-Min Lu is with the Department of Information Management, and the Center for Research in Econometric Theory and Applications, National Taiwan University, Taipei 106, Taiwan (e-mail: [email protected])
Abstract

Handwritten signature verification, crucial for legal and financial institutions, faces challenges including inter-writer similarity, intra-writer variations, and limited signature samples. To address these, we introduce a MultiScale Signature feature learning Network (MS-SigNet) with a novel metric learning loss called the co-tuplet loss, designed for offline handwritten signature verification. MS-SigNet learns both global and regional signature features from multiple spatial scales, enhancing feature discrimination. This approach effectively distinguishes genuine signatures from skilled forgeries by capturing overall strokes and detailed local differences. The co-tuplet loss, focusing on multiple positive and negative examples, overcomes the limitations of typical metric learning losses by addressing inter-writer similarity and intra-writer variations and emphasizing informative examples. We also present HanSig, a large-scale Chinese signature dataset (available at https://github.com/hsinmin/HanSig) to support robust system development. Experimental results on four benchmark datasets in different languages demonstrate the promising performance of our method in comparison to state-of-the-art approaches.

I Introduction

Handwritten signature verification aims to recognize individuals’ signatures for identity verification. This biometric verification approach is commonly accepted by government agencies and financial institutions [1]. The handwritten signature verification systems can be classified into two categories based on the signature acquisition device used: online and offline signature verification. Online verification involves capturing dynamic characteristics of the signing process, such as velocity and pressure. In contrast, offline verification refers to the static verification of scanned digital signatures. Since offline verification lacks dynamic characteristics, distinguishing between genuine and forged signatures is inherently more challenging. Moreover, discriminating between genuine signatures and skilled forgeries is difficult due to the high level of imitation similarity (inter-writer similarity). Additionally, practical factors such as significant variations within an individual’s signatures (intra-writer variations or intra-personal variability) and the limited number of available signature samples further complicate the implementation of automatic verification systems [2].

Early offline signature verification systems primarily relied on manual feature extraction methods [3, 4]. To address the challenge of intra-writer variations, capturing regional information from local signature regions has been proposed to provide details for static verification [5, 6, 7]. However, the traditional process involves sequentially applying independent steps of manual regional feature extraction, region similarity estimation, and similarity verification. These methods overlook the interdependencies between feature extraction and similarity measurement, resulting in suboptimal performance. In recent years, some studies [2, 8, 9, 10, 11, 12] have proposed the adoption of convolutional neural network (CNN)-based metric learning methods to integrate similarity measurement into automatic feature learning, overcoming the limitations of manual feature extraction. However, existing methods either learn from entire signature images or local regions, failing to exploit the complementary nature of global and regional information. Additionally, most of these methods trained with typical metric learning losses, such as contrastive loss [13] and triplet loss [14, 15], tend to suffer from slow convergence and bad local minima due to the pair or triplet sampling problem [16, 17].

To overcome the limitations of previous studies, we propose a MultiScale Signature feature learning Network (MS-SigNet) with a new metric learning loss called co-tuplet loss for offline signature verification. MS-SigNet simultaneously considers global and regional information in handwritten signatures, dividing deep feature maps into dual-orientation regions to learn local differences. Learning global representations aims to capture the overall information on signature strokes and configuration. However, given the high similarity between genuine signatures and skilled forgeries, it is also necessary to learn regional representations to explore local details. The proposed MS-SigNet can capture information from various spatial scales and integrate it to generate discriminative features. Additionally, we address the thin and sparse nature of signature strokes in images by aggregating information learned at different levels. We also introduce an attention module that guides the network to focus on important information by considering interactions between multiscale features. The design of our proposed signature verification system is based on the writer-independent (WI) approach. In contrast to the writer-dependent (WD) approach, WI has the advantages of leveraging information across signatures from different writers and requiring no system updates for new writers.

Refer to caption
Figure 1: Comparison between previous frameworks and our proposed framework for offline handwritten signature verification.

For effective similarity measurement, we propose the co-tuplet loss, a novel metric learning loss, to learn the distance metric for handwritten signature verification. The proposed co-tuplet loss aims to transform input features into a feature space where genuine signatures from the same writer are close to each other while corresponding forgeries are far away from genuine ones. Unlike the typical triplet loss [14, 15], the proposed co-tuplet loss simultaneously considers multiple genuine signatures and multiple forgeries to learn similarity metrics. This effectively addresses issues related to intra-writer variability and inter-writer similarity. Additionally, we emphasize the importance of batch construction and example selection to focus the training process on informative examples. It is worth noting that our approach combines the feature learning and similarity measurement steps and can be optimized end-to-end. Fig. 1 depicts the comparison between previous frameworks and our proposed framework for offline handwritten signature verification.

For training and evaluation, we utilize four offline handwritten signature datasets in different languages. Three of these datasets are publicly available: CEDAR [18], BHSig-Bengali [19], and BHSig-Hindi [19]. To address the lack of large-scale public offline Chinese signature datasets, we create the HanSig dataset, consisting of 35,400 signature samples from 238 writers. Experimental results demonstrate the promising performance of our proposed MS-SigNet with co-tuplet loss compared to state-of-the-art methods.

In summary, our contributions can be listed as follows:

  • We propose a multiscale feature learning method to generate discriminative features for offline handwritten signature verification. To our best knowledge, this is the first study that adopts deep end-to-end learning to automatically learn multiple spatial information and integrate it for signature verification.

  • We propose a new metric learning loss that enhances discriminative capability and facilitates better convergence of our network. This loss enables the training process to pay attention to informative examples and effectively tackles challenges associated with intra-writer signing variation and inter-writer similarity, resulting in improved performance.

  • Considering that few large-scale Chinese signature datasets are publicly available, we present the HanSig dataset, a large-scale offline Chinese signature dataset. Such datasets, which consider writers’ signing variations, are crucial for developing robust verification systems for this script.

II Related Research

II-A Offline handwritten signature verification

Given the wide use of the offline handwritten signatures, many new approaches for offline signature verification have been developed in the last ten years [20]. Most early work [3, 4] relied on manual feature extraction methods to capture signature stroke variations. However, the feature extraction process is easily disturbed by noise, leading to a limited capacity to extract complex features [21]. In recent times, there has been a growing interest in utilizing automatic feature extraction methods, particularly CNNs, to learn representations from signature images. These methods have effectively overcome the limitations of manual feature extraction. Several studies [22, 23, 24] employed CNNs as feature extractors, followed by training separate classifiers for forgery detection. Wei et al. [25] introduced a four-stream CNN to focus on the sparse stroke information.

For the improvement of offline signature verification, several studies have concentrated on regional information and local details to capture static properties. To address intra-writer variations, Pirlo and Impedovo [5] and Malik et al. [6] discovered that stable signature regions exhibit similar patterns among signatures from the same signer. Sharif et al. [7] combined global features with local features from 16 image parts. While these methods provided additional information for static signature verification, the separation of manual feature extraction and similarity measurement did not guarantee optimal performance. Liu et al. [2] proposed a region-based deep learning network that solely used local regions as inputs to obtain signature features.

In contrast to previous works, we propose an offline signature verification system that automatically learns feature representations from both the entire image and local regions. Unlike Liu et al. [2], our method combines global and regional information and divides deep feature maps into dual-orientation regions. This enables us to aggregate features from multiple scales and improve robustness against misalignment issues. Moreover, our system integrates similarity measurement with feature learning, thereby enhancing the entire training process compared to previous methods [5, 6, 7, 22, 23, 24].

II-B Metric learning-based methods

Recently, there has been an increased focus on employing metric learning-based methods to learn similarity and dissimilarity for feature representations. The objective of these methods is to learn a good distance metric that transforms input features into a new feature space, where instances belonging to the same class are close together and those from different classes are far apart [26, 27]. Commonly-used metric learning functions for learning pairwise similarities include contrastive loss [13] and triplet loss [14, 15] among others. Deep learning methods that integrate metric learning into feature learning have found wide applications in various domains, including face recognition [28] and person re-identification [27].

In the field of offline handwritten signature systems, metric learning-based methods have also shown promising results. Soleimani et al. [26] and Rantzsch et al. [8] were among the early researchers who introduced metric learning into signature verification. Dey et al. [9], Xing et al. [10], and Liu et al. [2] employed the Siamese network [29] for metric learning. Some studies have proposed improvements to existing metric learning structures to enhance the robustness of offline signature verification. For instance, Maergner et al. [11] combined a triplet loss-based CNN with the graph edit distance approach. Wan and Zou [12] and Zhu et al. [30] respectively developed a dual triplet loss and a point-to-set (P2S) metric to improve discrimination between genuine signatures and skilled forgeries.

The previous research on handwritten signature verification mainly employed typical metric learning losses or developed improved losses based on similar concepts. However, these losses often suffer from unstable and slow convergence due to inherent sampling problem [16, 17]. To address these limitations of typical metric learning losses, we propose a new metric learning loss. This loss shares similarities with previous tuplet-based losses such as the multi-class N-pair loss [16] and the tuplet margin loss [31]. However, we introduce a unique example selection and mining strategy specifically tailored for the signature verification task to facilitate better convergence.

Refer to caption
Figure 2: Overall architecture of MS-SigNet. Each set of the corresponding features generated from the proposed network is trained with individual co-tuplet losses.

II-C Main public offline signature datasets

According to Hameed et al. [21], the CEDAR [18], GPDS [32], and MCYT-75 [33] datasets are among the most commonly adopted Western signature datasets. UTSig [34] is a frequently used Persian signature dataset. Additionally, the BHSig260 dataset [19] offers two subsets comprising signatures in Bengali and Hindi languages. However, when it comes to offline Chinese handwritten signatures, there is a scarcity of publicly available datasets. Currently, the SigComp2011 [35] and ChiSig [36] datasets are the only existing public datasets for offline Chinese signatures. SigComp2011 has only 1,177 signature samples. In comparison, ChiSig is a new dataset that contains a more substantial number of samples, with a total of 10,242 signatures.

Considering that the characteristics of handwritten signatures differ across languages and scripts due to their unique writing styles [25], it is impractical to train a Chinese signature verification system using Western datasets. Moreover, the development of a realistic signature verification system requires the consideration of signature variability to avoid overfitting [20]. Therefore, we are motivated to create a new offline Chinese signature dataset that contains more samples and incorporates signature variability for each writer.

III Proposed Method

In this section, we introduce the MultiScale Signature feature learning Network (MS-SigNet), the handwritten signature verification method proposed in this study. Additionally, we introduce a novel metric learning loss called co-tuplet loss, which aims to improve the discriminative capability of the learned features for signature verification. Finally, we elaborate on the decision-making process in our signature verification system.

III-A Overall architecture

Fig. 2 illustrates the overall architecture of MS-SigNet. To integrate the discriminative signature information of different spatial scales, we propose to automatically learn robust feature representations from both the global and dual-orientation regional branches, which sets it apart from existing methods that rely on manual region feature extraction [5, 6, 7] and focus solely on local regions [2]. We modify the structure of SigNet-F [22] as the CNN backbone to build our branches and modules. We add rectified linear unit (ReLU) activation function and batch normalization (BN) after each convolutional layer to address issues such as vanishing gradients and overfitting. Consider a set of genuine signature images from a specific writer and their corresponding forged counterparts. We refer to this set as a “signature tuplet” in the subsequent discussion. Each input image, denoted as x𝑥xitalic_x, belonging to this signature tuplet passes through the base part to sequentially generate the output feature maps, F1𝐹1F1italic_F 1, F2𝐹2F2italic_F 2, F3𝐹3F3italic_F 3, and F4𝐹4F4italic_F 4. After the Conv4 layer, the network splits into the global and regional branches and learns to generate feature maps F51G𝐹51𝐺F51Gitalic_F 51 italic_G and F51R𝐹51𝑅F51Ritalic_F 51 italic_R in the respective branches. By jointly optimizing the global and regional branches, we can enhance the feature learning capability of the base part during the training process. Finally, by training each set of global and regional features with our proposed co-tuplet loss, we obtain the discriminative global and regional embeddings.

III-B Multilevel feature fusion

We observed that the signature strokes in the images are thin and sparse compared with general object images. In a typical CNN structure, the low-level features generated from the early layers contain more detailed information. However, information loss of stroke details is inevitable after several convolution and downsampling operators. In order to retain the detailed information of signature strokes, we propose a multilevel feature fusion mechanism that combines low-level features with high-level features and then passes the aggregated information to subsequent layers. Here, the feature maps F2𝐹2F2italic_F 2 and F3𝐹3F3italic_F 3 in the base part and F51G𝐹51𝐺F51Gitalic_F 51 italic_G in the global branch are fused to obtain the feature maps F52G𝐹52𝐺F52Gitalic_F 52 italic_G. Likewise, F2𝐹2F2italic_F 2 and F3𝐹3F3italic_F 3 in the base part and F51R𝐹51𝑅F51Ritalic_F 51 italic_R in the regional branch are fused to generate F52R𝐹52𝑅F52Ritalic_F 52 italic_R. We can express the fusion operations as follows:

F52G𝐹52𝐺\displaystyle F52Gitalic_F 52 italic_G =ε3×3(F2)ε3×3(F3)F51G,absentsuperscript𝜀33𝐹2superscript𝜀33𝐹3𝐹51𝐺\displaystyle=\varepsilon^{3\times 3}(F2)\circ\varepsilon^{3\times 3}(F3)\circ F% 51G,= italic_ε start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT ( italic_F 2 ) ∘ italic_ε start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT ( italic_F 3 ) ∘ italic_F 51 italic_G , (1)
F52R𝐹52𝑅\displaystyle F52Ritalic_F 52 italic_R =ε3×3(F2)ε3×3(F3)F51R,absentsuperscript𝜀33𝐹2superscript𝜀33𝐹3𝐹51𝑅\displaystyle=\varepsilon^{3\times 3}(F2)\circ\varepsilon^{3\times 3}(F3)\circ F% 51R,= italic_ε start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT ( italic_F 2 ) ∘ italic_ε start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT ( italic_F 3 ) ∘ italic_F 51 italic_R , (2)

where ε3×3superscript𝜀33\varepsilon^{3\times 3}italic_ε start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT is the convolution operation with a kernel size of 3×3333\times 33 × 3 and a stride of 2 to transfer F2C×H×W𝐹2superscript𝐶superscript𝐻superscript𝑊F2\in\mathbb{R}^{C\times H^{\prime}\times W^{\prime}}italic_F 2 ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and F3C×H×W𝐹3superscriptsuperscript𝐶superscript𝐻superscript𝑊F3\in\mathbb{R}^{C^{\prime}\times H^{\prime}\times W^{\prime}}italic_F 3 ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to the same shape as F51G𝐹51𝐺F51Gitalic_F 51 italic_G and F51RC×H×W𝐹51𝑅superscript𝐶𝐻𝑊F51R\in\mathbb{R}^{C\times H\times W}italic_F 51 italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT. The operator \circ denotes the element-wise multiplication.

In contrast to the fusion of only the base part features, we propose a different fusion mechanism that utilizes both the global and regional features for their respective branches. Thus, the detailed information of signature strokes can complement to the high-level features in each branch. In addition, instead of the commonly-used concatenation, we employ a multiplicative operation as the fusion strategy. Multiplication has advantages over concatenation as it allows the gradients of each layer to be correlated with the gradients of the other layers during gradient computation [37]. By using multiplication, features at different levels can depend on and interact with each other during the training process.

III-C Global-regional channel attention

In order to extract essential global and regional feature representations, we propose a global-regional channel attention (GRCA) module to guide our model in focusing on specific signature information. GRCA simultaneously learns the attention weights for global and regional features by considering their interactions and relative importance. Drawing inspiration from previous attention mechanisms [37, 38], we design GRCA tailored for our two-branch structure to facilitate the signature verification task.

As shown in Fig. 3, we perform global average pooling (GAP) on the feature maps F52GC×H×W𝐹52𝐺superscript𝐶𝐻𝑊F52G\in\mathbb{R}^{C\times H\times W}italic_F 52 italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT in the global branch and F52RC×H×W𝐹52𝑅superscript𝐶𝐻𝑊F52R\in\mathbb{R}^{C\times H\times W}italic_F 52 italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT in the regional branch to compress the spatial information of each channel into one channel descriptor. We obtain two channel descriptors D1GC×1×1𝐷1𝐺superscript𝐶11D1G\in\mathbb{R}^{C\times 1\times 1}italic_D 1 italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 1 × 1 end_POSTSUPERSCRIPT and D1RC×1×1𝐷1𝑅superscript𝐶11D1R\in\mathbb{R}^{C\times 1\times 1}italic_D 1 italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 1 × 1 end_POSTSUPERSCRIPT:

D1Gc𝐷1superscript𝐺𝑐\displaystyle D1G^{c}italic_D 1 italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT =ϕ(𝒵gc)=1HWi=1Hj=1W𝒵gc(i,j),absentitalic-ϕsubscriptsuperscript𝒵𝑐𝑔1𝐻𝑊subscriptsuperscript𝐻𝑖1subscriptsuperscript𝑊𝑗1subscriptsuperscript𝒵𝑐𝑔𝑖𝑗\displaystyle=\phi(\mathcal{Z}^{c}_{g})=\frac{1}{HW}\sum^{H}_{i=1}\sum^{W}_{j=% 1}\mathcal{Z}^{c}_{g}(i,j),= italic_ϕ ( caligraphic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT caligraphic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_i , italic_j ) , (3)
D1Rc𝐷1superscript𝑅𝑐\displaystyle D1R^{c}italic_D 1 italic_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT =ϕ(𝒵rc)=1HWi=1Hj=1W𝒵rc(i,j),absentitalic-ϕsubscriptsuperscript𝒵𝑐𝑟1𝐻𝑊subscriptsuperscript𝐻𝑖1subscriptsuperscript𝑊𝑗1subscriptsuperscript𝒵𝑐𝑟𝑖𝑗\displaystyle=\phi(\mathcal{Z}^{c}_{r})=\frac{1}{HW}\sum^{H}_{i=1}\sum^{W}_{j=% 1}\mathcal{Z}^{c}_{r}(i,j),= italic_ϕ ( caligraphic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT caligraphic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_i , italic_j ) , (4)

where ϕitalic-ϕ\phiitalic_ϕ is the GAP operation, 𝒵gcsubscriptsuperscript𝒵𝑐𝑔\mathcal{Z}^{c}_{g}caligraphic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the c𝑐citalic_c-th channel of F52G𝐹52𝐺F52Gitalic_F 52 italic_G, and 𝒵rcsubscriptsuperscript𝒵𝑐𝑟\mathcal{Z}^{c}_{r}caligraphic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the c𝑐citalic_c-th channel of F52R𝐹52𝑅F52Ritalic_F 52 italic_R, for c=1,,C𝑐1𝐶c=1,\ldots,Citalic_c = 1 , … , italic_C. For attention map learning, we first use a convolutional layer with a kernel size of 1×1111\times 11 × 1 followed by a ReLU activation function to convert the channel descriptors D1G𝐷1𝐺D1Gitalic_D 1 italic_G into D2GV×1×1𝐷2𝐺superscript𝑉11D2G\in\mathbb{R}^{V\times 1\times 1}italic_D 2 italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × 1 × 1 end_POSTSUPERSCRIPT and D1R𝐷1𝑅D1Ritalic_D 1 italic_R into D2RV×1×1𝐷2𝑅superscript𝑉11D2R\in\mathbb{R}^{V\times 1\times 1}italic_D 2 italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × 1 × 1 end_POSTSUPERSCRIPT. We set V<C𝑉𝐶V<Citalic_V < italic_C, such that the dimension reduction operation can reduce computational and parameter overhead. Subsequently, we combine D2G𝐷2𝐺D2Gitalic_D 2 italic_G and D2R𝐷2𝑅D2Ritalic_D 2 italic_R to obtain the fused descriptors DFV×1×1𝐷𝐹superscript𝑉11{DF}\in\mathbb{R}^{V\times 1\times 1}italic_D italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × 1 × 1 end_POSTSUPERSCRIPT using the multiplicative operation. The fusion of D2G𝐷2𝐺D2Gitalic_D 2 italic_G and D2R𝐷2𝑅D2Ritalic_D 2 italic_R enables the simultaneous generation of channel-wise attention for global and regional features based on their relative importance.

Refer to caption
Figure 3: Detailed structure of the GRCA module.

Next, we perform a dimension recovery operation for regional and global branches using convolution with a kernel size of 1×1111\times 11 × 1, and we obtain the normalized global and regional attention maps Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and MrC×1×1subscript𝑀𝑟superscript𝐶11M_{r}\in\mathbb{R}^{C\times 1\times 1}italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 1 × 1 end_POSTSUPERSCRIPT using the sigmoid function. The entire process of attention map learning can be expressed as follows:

Mgsubscript𝑀𝑔\displaystyle M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT =σ(ε1×1(γ(ε1×1(D1G))γ(ε1×1(D1R)))),absent𝜎superscript𝜀11𝛾superscript𝜀11𝐷1𝐺𝛾superscript𝜀11𝐷1𝑅\displaystyle=\sigma(\varepsilon^{1\times 1}(\gamma(\varepsilon^{1\times 1}(D1% G))\circ\gamma(\varepsilon^{1\times 1}(D1R)))),= italic_σ ( italic_ε start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT ( italic_γ ( italic_ε start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT ( italic_D 1 italic_G ) ) ∘ italic_γ ( italic_ε start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT ( italic_D 1 italic_R ) ) ) ) , (5)
Mrsubscript𝑀𝑟\displaystyle M_{r}italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =σ(ε1×1(γ(ε1×1(D1G))γ(ε1×1(D1R)))),absent𝜎superscript𝜀11𝛾superscript𝜀11𝐷1𝐺𝛾superscript𝜀11𝐷1𝑅\displaystyle=\sigma(\varepsilon^{1\times 1}(\gamma(\varepsilon^{1\times 1}(D1% G))\circ\gamma(\varepsilon^{1\times 1}(D1R)))),= italic_σ ( italic_ε start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT ( italic_γ ( italic_ε start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT ( italic_D 1 italic_G ) ) ∘ italic_γ ( italic_ε start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT ( italic_D 1 italic_R ) ) ) ) , (6)

where ε1×1superscript𝜀11\varepsilon^{1\times 1}italic_ε start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT is the convolution operation with a kernel size of 1×1111\times 11 × 1, γ𝛾\gammaitalic_γ represents the ReLU function, and σ𝜎\sigmaitalic_σ represents the sigmoid function. We finally multiply each channel of F52G𝐹52𝐺F52Gitalic_F 52 italic_G by each weight value of Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to obtain F52GC×H×W𝐹52superscript𝐺superscript𝐶𝐻𝑊F52G^{\prime}\in\mathbb{R}^{C\times H\times W}italic_F 52 italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT and perform the same operation on F52R𝐹52𝑅F52Ritalic_F 52 italic_R and Mrsubscript𝑀𝑟M_{r}italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to obtain F52RC×H×W𝐹52superscript𝑅superscript𝐶𝐻𝑊F52R^{\prime}\in\mathbb{R}^{C\times H\times W}italic_F 52 italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT.

III-D Multiscale feature learning

We propose automatic multiscale feature learning to address the limitations of learning from a single branch. Here, “multiscale” refers to multiple scales (i.e., sizes) of global and dual-orientation regional feature maps. By learning complementary feature representations from various spatial scales, our method can effectively handle the challenges posed by high inter-writer similarity and intra-writer variations. To capture signature stroke information from the entire image, we first conduct the GAP operation over the feature maps F52GC×H×W𝐹52superscript𝐺superscript𝐶𝐻𝑊F52G^{\prime}\in\mathbb{R}^{C\times H\times W}italic_F 52 italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT in the global branch, as shown in Fig. 2. Subsequently, we use a fully connected (FC) layer with the output dimension of 1,024 to generate the global embedding fgsubscript𝑓𝑔f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT for a signature image. Instead of using multiple FC layers, we adopt GAP followed by a single FC layer to reduce the model parameters and to address overfitting issues. Additionally, we apply L2-Normalization before the output layer to mitigate the impact of scale variability in the data, which enhances training stability and boosts performance. The process of global feature learning can be formulated as follows:

ϕ(𝒰gc)italic-ϕsubscriptsuperscript𝒰𝑐𝑔\displaystyle\phi(\mathcal{U}^{c}_{g})italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) =1HWi=1Hj=1W𝒰gc(i,j),absent1𝐻𝑊subscriptsuperscript𝐻𝑖1subscriptsuperscript𝑊𝑗1subscriptsuperscript𝒰𝑐𝑔𝑖𝑗\displaystyle=\frac{1}{HW}\sum^{H}_{i=1}\sum^{W}_{j=1}\mathcal{U}^{c}_{g}(i,j),= divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_i , italic_j ) , (7)
fgsubscript𝑓𝑔\displaystyle f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT =η([ϕ(𝒰g1),ϕ(𝒰g2),,ϕ(𝒰gC)]),absent𝜂italic-ϕsubscriptsuperscript𝒰1𝑔italic-ϕsubscriptsuperscript𝒰2𝑔italic-ϕsubscriptsuperscript𝒰𝐶𝑔\displaystyle=\eta([\phi(\mathcal{U}^{1}_{g}),\phi(\mathcal{U}^{2}_{g}),...,% \phi(\mathcal{U}^{C}_{g})]),= italic_η ( [ italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , … , italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ] ) , (8)

where ϕitalic-ϕ\phiitalic_ϕ is the GAP operation, η𝜂\etaitalic_η is the FC layer, and 𝒰gcsubscriptsuperscript𝒰𝑐𝑔\mathcal{U}^{c}_{g}caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the c𝑐citalic_c-th channel of F52G𝐹52superscript𝐺F52G^{\prime}italic_F 52 italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, for c=1,,C𝑐1𝐶c=1,\ldots,Citalic_c = 1 , … , italic_C.

Inspired by Pirlo and Impedovo [5], we propose dual-orientation regional feature learning to complement global feature learning. Rather than using a fixed-sized sliding window along one direction of an input image for region segmentation, we propose to divide the deep feature maps into regions with different scales along horizontal and vertical orientations. This dual-orientation regional feature learning can effectively capture more localized differences and dissimilarities between genuine and skilled-forged signatures. Our method does not require input segmentation and does not increase the number of model inputs, resulting in improved efficiency, particularly for larger datasets.

We first divide the feature maps F52RC×H×W𝐹52superscript𝑅superscript𝐶𝐻𝑊F52R^{\prime}\in\mathbb{R}^{C\times H\times W}italic_F 52 italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT into three overlapping vertical regions Fr1subscript𝐹subscript𝑟1F_{r_{1}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Fr2subscript𝐹subscript𝑟2F_{r_{2}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and Fr3C×H×W′′subscript𝐹subscript𝑟3superscript𝐶𝐻superscript𝑊′′F_{r_{3}}\in\mathbb{R}^{C\times H\times W^{\prime\prime}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (W′′<W)superscript𝑊′′𝑊(W^{\prime\prime}<W)( italic_W start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT < italic_W ) from left to right, as shown in the regional branch of Fig. 2. We also divide the feature maps F52R𝐹52superscript𝑅F52R^{\prime}italic_F 52 italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into three overlapping horizontal regions Fr4subscript𝐹subscript𝑟4F_{r_{4}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Fr5subscript𝐹subscript𝑟5F_{r_{5}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and Fr6C×H′′×Wsubscript𝐹subscript𝑟6superscript𝐶superscript𝐻′′𝑊F_{r_{6}}\in\mathbb{R}^{C\times H^{\prime\prime}\times W}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT × italic_W end_POSTSUPERSCRIPT (H′′<H)superscript𝐻′′𝐻(H^{\prime\prime}<H)( italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT < italic_H ) from top to bottom. To address the potential misalignment, we make adjacent regions overlap each other. The process of region division can be expressed as follows:

Frmsubscript𝐹subscript𝑟𝑚\displaystyle F_{r_{m}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT =ψH(F52R,m),m{1,2,3},formulae-sequenceabsentsubscript𝜓𝐻𝐹52superscript𝑅𝑚𝑚123\displaystyle=\psi_{H}(F52R^{\prime},m),{\quad}m\in\{1,2,3\},= italic_ψ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_F 52 italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m ) , italic_m ∈ { 1 , 2 , 3 } , (9)
Frnsubscript𝐹subscript𝑟𝑛\displaystyle F_{r_{n}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT =ψV(F52R,n),n{4,5,6},formulae-sequenceabsentsubscript𝜓𝑉𝐹52superscript𝑅𝑛𝑛456\displaystyle=\psi_{V}(F52R^{\prime},n),{\quad}n\in\{4,5,6\},= italic_ψ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_F 52 italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_n ) , italic_n ∈ { 4 , 5 , 6 } , (10)

where ψHsubscript𝜓𝐻\psi_{H}italic_ψ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT denotes the feature map division operation in the horizontal orientation, and ψVsubscript𝜓𝑉\psi_{V}italic_ψ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT is the same operation in the vertical orientation. Specifically, we empirically set W′′=13superscript𝑊′′13W^{\prime\prime}=13italic_W start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = 13 and the overlap between Frmsubscript𝐹subscript𝑟𝑚F_{r_{m}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be 7 pixels in width, and we empirically set H′′=8superscript𝐻′′8H^{\prime\prime}=8italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = 8 and set the overlap between Frnsubscript𝐹subscript𝑟𝑛F_{r_{n}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be 4 pixels in height.

Finally, we obtain six regions with varying scales and conduct GAP operations over each of them, followed by an FC layer to generate 1024-dimensional regional embeddings, frmsubscript𝑓subscript𝑟𝑚f_{r_{m}}italic_f start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT, m{1,2,3}𝑚123m\in\{1,2,3\}italic_m ∈ { 1 , 2 , 3 }, and frnsubscript𝑓subscript𝑟𝑛f_{r_{n}}italic_f start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT, n{4,5,6}𝑛456n\in\{4,5,6\}italic_n ∈ { 4 , 5 , 6 }, for each signature image. The process of generating regional embeddings can be formulated as follows:

ϕ(𝒰rmc)italic-ϕsubscriptsuperscript𝒰𝑐subscript𝑟𝑚\displaystyle\phi(\mathcal{U}^{c}_{r_{m}})italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =1HWi=1Hj=113𝒰rmc(i,j),absent1𝐻𝑊subscriptsuperscript𝐻𝑖1subscriptsuperscript13𝑗1subscriptsuperscript𝒰𝑐subscript𝑟𝑚𝑖𝑗\displaystyle=\frac{1}{HW}\sum^{H}_{i=1}\sum^{13}_{j=1}\mathcal{U}^{c}_{r_{m}}% (i,j),= divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i , italic_j ) , (11)
frmsubscript𝑓subscript𝑟𝑚\displaystyle f_{r_{m}}italic_f start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT =η([ϕ(𝒰rm1),ϕ(𝒰rm2),,ϕ(𝒰rmC)]),absent𝜂italic-ϕsubscriptsuperscript𝒰1subscript𝑟𝑚italic-ϕsubscriptsuperscript𝒰2subscript𝑟𝑚italic-ϕsubscriptsuperscript𝒰𝐶subscript𝑟𝑚\displaystyle=\eta([\phi(\mathcal{U}^{1}_{r_{m}}),\phi(\mathcal{U}^{2}_{r_{m}}% ),...,\phi(\mathcal{U}^{C}_{r_{m}})]),= italic_η ( [ italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , … , italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] ) , (12)
ϕ(𝒰rnc)italic-ϕsubscriptsuperscript𝒰𝑐subscript𝑟𝑛\displaystyle\phi(\mathcal{U}^{c}_{r_{n}})italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =1HWi=18j=1W𝒰rnc(i,j),absent1𝐻𝑊subscriptsuperscript8𝑖1subscriptsuperscript𝑊𝑗1subscriptsuperscript𝒰𝑐subscript𝑟𝑛𝑖𝑗\displaystyle=\frac{1}{HW}\sum^{8}_{i=1}\sum^{W}_{j=1}\mathcal{U}^{c}_{r_{n}}(% i,j),= divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i , italic_j ) , (13)
frnsubscript𝑓subscript𝑟𝑛\displaystyle f_{r_{n}}italic_f start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT =η([ϕ(𝒰rn1),ϕ(𝒰rn2),,ϕ(𝒰rnC)]),absent𝜂italic-ϕsubscriptsuperscript𝒰1subscript𝑟𝑛italic-ϕsubscriptsuperscript𝒰2subscript𝑟𝑛italic-ϕsubscriptsuperscript𝒰𝐶subscript𝑟𝑛\displaystyle=\eta([\phi(\mathcal{U}^{1}_{r_{n}}),\phi(\mathcal{U}^{2}_{r_{n}}% ),...,\phi(\mathcal{U}^{C}_{r_{n}})]),= italic_η ( [ italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , … , italic_ϕ ( caligraphic_U start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] ) , (14)

where ϕitalic-ϕ\phiitalic_ϕ is the GAP operation, η𝜂\etaitalic_η is the FC layer, 𝒰rmcsubscriptsuperscript𝒰𝑐subscript𝑟𝑚\mathcal{U}^{c}_{r_{m}}caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the c𝑐citalic_c-th channel of Frmsubscript𝐹subscript𝑟𝑚F_{r_{m}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and 𝒰rncsubscriptsuperscript𝒰𝑐subscript𝑟𝑛\mathcal{U}^{c}_{r_{n}}caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the c𝑐citalic_c-th channel of Frnsubscript𝐹subscript𝑟𝑛F_{r_{n}}italic_F start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for c=1,,C𝑐1𝐶c=1,\ldots,Citalic_c = 1 , … , italic_C. Similar to the global embedding, we apply L2-Normalization before the output layer for generating regional embeddings. We demonstrate the performance improvement achieved through multiscale feature learning and visualize the differences in their effects in Section IV.

III-E Co-tuplet loss

III-E1 Limitations of typical loss functions

Previous automatic signature verification methods [9, 11, 12, 22, 23, 24, 39, 40] have employed the classification loss and the typical metric learning loss to learn similarity measurement. However, the methods trained with the classification loss, such as the categorical cross-entropy loss, can only differentiate between genuine signatures of different writers, limiting their application to random forgery detection or preliminary feature extraction for a WD classifier. In contrast, typical metric learning loss functions, such as contrastive loss [13] and triplet loss [14, 15], use a single randomly selected negative example in each update, often resulting in unstable and slow convergence [16, 17]. To address this problem, the tuplet-based loss functions [16, 31] consider multiple negative examples from different classes and aim to increase the inter-class distance by pushing them apart in each update.

In contrast to general object recognition tasks focusing on distinguishing between different classes, handwritten signature verification requires discriminating between positive examples (i.e., genuine signatures) and their corresponding negative examples (i.e., skilled forgeries) within each writer. The challenge lies in the presence of potentially high intra-writer variability in genuine signatures and high inter-writer similarity between genuine and forged signatures. This implies that the distances between most genuine-genuine pairs may be large, while the distances between genuine-forged pairs tend to be small. Using only one positive example in each update is insufficient for considering the distances of multiple positive examples, thus failing to address the intra-writer distance variation effectively. Similarly, relying on a single negative example in each update does not adequately account for the inter-writer similarity between a positive example and the remaining negative examples.

To address the challenges in handwritten signature verification, we propose a new tuplet-based metric learning loss called the co-tuplet loss. It combines the property of tuplet-based loss functions, allowing for the inclusion of multiple negative examples to learn an appropriate distance metric between genuine and forged signatures. Additionally, it employs multiple positive examples, enabling the model to learn the relationship between genuine signatures within each writer, which was previously missing in tuplet-based loss functions [16, 31].

III-E2 Proposed loss function

We define a tuplet as {xa,xp1,xp2,,xpk,xn1,xn2,,xnk}subscript𝑥𝑎subscript𝑥subscript𝑝1subscript𝑥subscript𝑝2subscript𝑥subscript𝑝𝑘subscript𝑥subscript𝑛1subscript𝑥subscript𝑛2subscript𝑥subscript𝑛𝑘\{x_{a},x_{p_{1}},x_{p_{2}},\ldots,x_{p_{k}},x_{n_{1}},x_{n_{2}},\ldots,x_{n_{% k}}\}{ italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, where xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT refer to the anchor, positive, and negative examples, respectively. The integer k𝑘kitalic_k is the number of positive and negative examples in a mini-batch. In this study, the anchor and positive examples are genuine signatures signed by a writer, while the negative examples are corresponding skilled forgeries. We aim to shorten the intra-writer distance and enlarge the inter-writer distance in the embedding space. To achieve this, we jointly consider the distances of multiple positive and negative examples from the same anchor. The formulation of the co-tuplet loss is as follows:

ct=log[1+i𝒮(𝒫)exp(di+dh)+j𝒮(𝒩)exp(dh+dj)],subscript𝑐𝑡1subscript𝑖𝒮𝒫subscriptsuperscript𝑑𝑖subscriptsuperscript𝑑subscript𝑗𝒮𝒩subscriptsuperscript𝑑subscriptsuperscript𝑑𝑗\mathcal{L}_{ct}=\log[1+\sum_{i\in\mathcal{S}(\mathcal{P})}\exp(d^{+}_{i}-d^{-% }_{h})+\sum_{j\in\mathcal{S}(\mathcal{N})}\exp(d^{+}_{h}-d^{-}_{j})],caligraphic_L start_POSTSUBSCRIPT italic_c italic_t end_POSTSUBSCRIPT = roman_log [ 1 + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S ( caligraphic_P ) end_POSTSUBSCRIPT roman_exp ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_S ( caligraphic_N ) end_POSTSUBSCRIPT roman_exp ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] , (15)

where 𝒮(𝒫)𝒮𝒫\mathcal{S}(\mathcal{P})caligraphic_S ( caligraphic_P ) and 𝒮(𝒩)𝒮𝒩\mathcal{S}(\mathcal{N})caligraphic_S ( caligraphic_N ) are the sets of positive and negative example indices for which the positive and negative examples satisfy our mining strategy described in the next subsection, d𝑑ditalic_d indicates the squared Euclidean distance used as the distance metric; and di+subscriptsuperscript𝑑𝑖d^{+}_{i}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, djsubscriptsuperscript𝑑𝑗d^{-}_{j}italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, dh+subscriptsuperscript𝑑d^{+}_{h}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, and dhsubscriptsuperscript𝑑d^{-}_{h}italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are defined as follows:

di+subscriptsuperscript𝑑𝑖\displaystyle d^{+}_{i}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =f(xa)f(xpi)22,absentsubscriptsuperscriptnorm𝑓subscript𝑥𝑎𝑓subscript𝑥subscript𝑝𝑖22\displaystyle=\|f(x_{a})-f(x_{p_{i}})\|^{2}_{2},= ∥ italic_f ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (16)
djsubscriptsuperscript𝑑𝑗\displaystyle d^{-}_{j}italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =f(xa)f(xnj)22,absentsubscriptsuperscriptnorm𝑓subscript𝑥𝑎𝑓subscript𝑥subscript𝑛𝑗22\displaystyle=\|f(x_{a})-f(x_{n_{j}})\|^{2}_{2},= ∥ italic_f ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (17)
dh+subscriptsuperscript𝑑\displaystyle d^{+}_{h}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT =max=1kf(xa)f(xp)22,absentsubscript1𝑘subscriptsuperscriptnorm𝑓subscript𝑥𝑎𝑓subscript𝑥subscript𝑝22\displaystyle=\max_{\ell=1\ldots k}\|f(x_{a})-f(x_{p_{\ell}})\|^{2}_{2},= roman_max start_POSTSUBSCRIPT roman_ℓ = 1 … italic_k end_POSTSUBSCRIPT ∥ italic_f ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (18)
dhsubscriptsuperscript𝑑\displaystyle d^{-}_{h}italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT =min=1kf(xa)f(xn)22,absentsubscript1𝑘subscriptsuperscriptnorm𝑓subscript𝑥𝑎𝑓subscript𝑥subscript𝑛22\displaystyle=\min_{\ell=1\ldots k}\|f(x_{a})-f(x_{n_{\ell}})\|^{2}_{2},= roman_min start_POSTSUBSCRIPT roman_ℓ = 1 … italic_k end_POSTSUBSCRIPT ∥ italic_f ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (19)

where f()𝑓f(\cdot)italic_f ( ⋅ ) represents the feature embedding of an input example. Among the positive examples in a mini-batch, dh+subscriptsuperscript𝑑d^{+}_{h}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the distance between the anchor and the hardest positive example. Similarly, among the negative examples in a mini-batch, dhsubscriptsuperscript𝑑d^{-}_{h}italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the distance between the anchor and the hardest negative example.

Refer to caption
Figure 4: Example of the distance learning process of the common triplet loss, hardest triplet loss, and co-tuplet loss.

The proposed loss comprises two parts, pulling and pushing parts. We design the pulling part to decrease the intra-writer distance by pulling the selected positive examples closer to the anchor using dhsubscriptsuperscript𝑑d^{-}_{h}italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT as the reference distance. In addition, we use the pushing part to increase the inter-writer distance by pushing the selected negative examples farther from the anchor using dh+subscriptsuperscript𝑑d^{+}_{h}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT as the reference distance. This reduces the computational complexity from quadratic to linear compared to pairwise comparisons between positive and negative examples. Given that the proposed tuplet-based loss relies on the co-existence of multiple positive and negative examples to learn distance metrics, we refer to it as the “co-tuplet loss.”

Fig. 4 provides a visual representation of the differences in the distance learning process between the triplet loss, the hardest triplet loss, and the co-tuplet loss. The triplet loss only partially utilizes the distance information of batch examples, as many easy triplets that already satisfy the triplet constraint do not contribute to the training process. Similarly, the hardest triplet loss, which focuses on the hardest positive and negative examples, neglects optimizing the distances from the anchor to the remaining positive and negative examples. In contrast, the proposed co-tuplet loss simultaneously considers multiple positive and negative examples in a mini-batch for distance optimization. This allows the signature verification system to learn to pull genuine signatures belonging to the same writer close together, while pushing forgeries far away from them.

III-E3 Batch construction and constraint mining strategy

To construct a training mini-batch, we first randomly select w genuine signatures without replacement as anchor examples. For each anchor, k𝑘kitalic_k genuine signatures are randomly sampled (excluding the anchor itself) as positive examples from the same writer. Similarly, k𝑘kitalic_k negative examples are randomly sampled from the corresponding forgeries. Together, the anchor, positive examples, and negative examples form a signature tuplet. We repeat this process to generate w tuplets for the training mini-batch.

To select training examples, we propose a constraint mining strategy that focuses the learning process on informative signature examples rather than trivial ones. To the best of our knowledge, this mining strategy is not considered in existing tuplet-based losses [16, 31]. We identify that very easy positive and negative examples still contribute to the loss values in Eq. (15), even though they provide uninformative and redundant information for embedding learning. Hence, we employ the constraint mining strategy to select informative examples. The selected positive and negative examples must satisfy the following constraints:

di+dhδ,subscriptsuperscript𝑑𝑖subscriptsuperscript𝑑𝛿\displaystyle d^{+}_{i}\geq d^{-}_{h}-\delta,italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_δ , (20)
djdh++δ,subscriptsuperscript𝑑𝑗subscriptsuperscript𝑑𝛿\displaystyle d^{-}_{j}\leq d^{+}_{h}+\delta,italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_δ , (21)

where δ>0𝛿0\delta>0italic_δ > 0 is a constraint margin. Here an appropriate selection of δ𝛿\deltaitalic_δ ensures that the optimization process does not concentrate on useless information and facilitates our model in learning the accurate mapping of intra-writer and inter-writer distances.

III-E4 Gradient computation

We can obtain the gradient of the proposed co-tuplet loss ctsubscript𝑐𝑡\mathcal{L}_{ct}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t end_POSTSUBSCRIPT with respect to the model parameters 𝜽𝜽\bm{\theta}bold_italic_θ:

ct𝜽subscript𝑐𝑡𝜽\displaystyle\frac{\partial\mathcal{L}_{ct}}{\partial\bm{\theta}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_c italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG
=1q[i𝒮(𝒫)exp(di+dh)(di+dh)𝜽\displaystyle=\frac{1}{q}\Big{[}\sum_{i\in\mathcal{S}(\mathcal{P})}\exp(d^{+}_% {i}-d^{-}_{h})\frac{\partial(d^{+}_{i}-d^{-}_{h})}{\partial\bm{\theta}}= divide start_ARG 1 end_ARG start_ARG italic_q end_ARG [ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S ( caligraphic_P ) end_POSTSUBSCRIPT roman_exp ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) divide start_ARG ∂ ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ bold_italic_θ end_ARG
+j𝒮(𝒩)exp(dh+dj)(dh+dj)𝜽]\displaystyle\quad+\sum_{j\in\mathcal{S}(\mathcal{N})}\exp(d^{+}_{h}-d^{-}_{j}% )\frac{\partial(d^{+}_{h}-d^{-}_{j})}{\partial\bm{\theta}}\Big{]}+ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_S ( caligraphic_N ) end_POSTSUBSCRIPT roman_exp ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) divide start_ARG ∂ ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ bold_italic_θ end_ARG ]
1q[i𝒮(𝒫)w1,i(di+dh)𝜽+j𝒮(𝒩)w2,j(dh+dj)𝜽],absent1𝑞delimited-[]subscript𝑖𝒮𝒫subscript𝑤1𝑖subscriptsuperscript𝑑𝑖subscriptsuperscript𝑑𝜽subscript𝑗𝒮𝒩subscript𝑤2𝑗subscriptsuperscript𝑑subscriptsuperscript𝑑𝑗𝜽\displaystyle\equiv\frac{1}{q}\Big{[}\sum_{i\in\mathcal{S}(\mathcal{P})}w_{1,i% }\frac{\partial(d^{+}_{i}-d^{-}_{h})}{\partial\bm{\theta}}+\sum_{j\in\mathcal{% S}(\mathcal{N})}w_{2,j}\frac{\partial(d^{+}_{h}-d^{-}_{j})}{\partial\bm{\theta% }}\Big{]},≡ divide start_ARG 1 end_ARG start_ARG italic_q end_ARG [ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S ( caligraphic_P ) end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT divide start_ARG ∂ ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ bold_italic_θ end_ARG + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_S ( caligraphic_N ) end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT divide start_ARG ∂ ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ bold_italic_θ end_ARG ] , (22)

where

q=1+i𝒮(𝒫)exp(di+dh)+j𝒮(𝒩)exp(dh+dj),𝑞1subscript𝑖𝒮𝒫subscriptsuperscript𝑑𝑖subscriptsuperscript𝑑subscript𝑗𝒮𝒩subscriptsuperscript𝑑subscriptsuperscript𝑑𝑗q=1+\sum_{i\in\mathcal{S}(\mathcal{P})}\exp(d^{+}_{i}-d^{-}_{h})+\sum_{j\in% \mathcal{S}(\mathcal{N})}\exp(d^{+}_{h}-d^{-}_{j}),italic_q = 1 + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S ( caligraphic_P ) end_POSTSUBSCRIPT roman_exp ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_S ( caligraphic_N ) end_POSTSUBSCRIPT roman_exp ( italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (23)

and w1,isubscript𝑤1𝑖w_{1,i}italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and w2,jsubscript𝑤2𝑗w_{2,j}italic_w start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT denote the weights of the pulling and pushing parts, respectively. As observed from the above gradient, both w1,isubscript𝑤1𝑖w_{1,i}italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and w2,jsubscript𝑤2𝑗w_{2,j}italic_w start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT are exponentially up-weighted with hard examples and exponentially down-weighted with easy ones.

In comparison to typical metric learning losses, our co-tuplet loss offers the advantage of emphasizing informative examples over uninformative ones by assigning unequal weights to examples based on the distance difference. Since skilled-forged signatures often closely resemble genuine signatures for each writer, this weighting scheme promotes the learning of more discriminative features for handwritten signature verification. Furthermore, the co-tuplet loss takes into account moderate examples (i.e., the examples of normal cases) for distance learning. It avoids overweighting extremely hard examples, leading to a more stable optimization process.

III-F Signature verification decision

In our approach, we use individual co-tuplet losses to train each set of the corresponding features instead of a single loss for the concatenation of generated multiscale features. This training strategy enables the model to learn specific information from each part of the signature strokes. For joint multiscale feature learning, we define the overall objective function as follows:

ct,T=ct,g+λi=16ct,ri,subscript𝑐𝑡𝑇subscript𝑐𝑡𝑔𝜆subscriptsuperscript6𝑖1subscript𝑐𝑡subscript𝑟𝑖\mathcal{L}_{ct,T}=\mathcal{L}_{ct,g}+\lambda\sum^{6}_{i=1}\mathcal{L}_{ct,r_{% i}},caligraphic_L start_POSTSUBSCRIPT italic_c italic_t , italic_T end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_t , italic_g end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_t , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (24)

where λ𝜆\lambdaitalic_λ is a hyperparameter used to control the weight of the regional losses.

In the verification stage, namely the test stage, we integrate various spatial information by concatenating the global and regional embeddings into the final embedding for each input image. To make the verification decision, we use a distance threshold dthrsubscript𝑑𝑡𝑟d_{thr}italic_d start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT to decide whether a given signature pair {xi,xj}subscript𝑥𝑖subscript𝑥𝑗\{x_{i},x_{j}\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } is positive or negative. A positive verification decision indicates that the questioned signature xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is accepted as genuine with respect to the reference signature xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Conversely, xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with a negative decision is regarded as forged with respect to xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We define the final signature verification decision as follows:

Decision={positive,ifd(xi,xj)dthrnegative,ifd(xi,xj)>dthr,Decisioncasespositiveif𝑑subscript𝑥𝑖subscript𝑥𝑗subscript𝑑𝑡𝑟negativeif𝑑subscript𝑥𝑖subscript𝑥𝑗subscript𝑑𝑡𝑟{\text{Decision}}=\begin{cases}\text{positive},&{\text{if}}\ d(x_{i},x_{j})% \leq d_{thr}\\ {\text{negative},}&{\text{if}}\ d(x_{i},x_{j})>d_{thr},\end{cases}Decision = { start_ROW start_CELL positive , end_CELL start_CELL if italic_d ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL negative , end_CELL start_CELL if italic_d ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) > italic_d start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT , end_CELL end_ROW (25)

where d(xi,xj)𝑑subscript𝑥𝑖subscript𝑥𝑗d(x_{i},x_{j})italic_d ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the squared Euclidean distance between the final embeddings of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

IV Experiments

In this section, we first describe the benchmark datasets. Following that, we elaborate on the data preprocessing, implementation particulars, and the evaluation metrics used for signature verification. Lastly, we present the experimental results and compare them with various state-of-the-art methods.

IV-A Datasets

The CEDAR dataset [18] consists of English signatures from 55 different writers. Each writer contributed 24 genuine signatures for a specific name and had 24 skilled forgeries generated by forgers. To compare the proposed method with state-of-the-art methods, we follow most previous studies [2, 10, 25, 41, 42] to randomly divide the writers into a training set that includes signatures from 50 writers, and a test set that contains signatures from five writers. We further randomly reserve five writers from the training set for validation. For each writer in the test set, we use one genuine signature as the reference signature and another genuine signature as the questioned signature to form 276 (24×23/2)24232(24\times 23/2)( 24 × 23 / 2 ) positive pairs. We form negative pairs by using one genuine signature as the reference and one forged signature as the questioned signature. We also follow previous studies to balance the positive and negative pairs and randomly select 276 negative pairs. The final test data consists of 2,760 signature pairs.

The BHSig-Bengali dataset [19] comprises Bengali signatures from 100 different writers. Each writer contributed 24 genuine signatures for a specific name, along with 30 skilled forgeries. Following the data splitting scheme in [9, 25], we randomly select signatures from 50 writers to form the training set, while the remaining writers’ signatures are used for the test set. We further reserve five writers’ signatures from the training set for validation. Similarly, for each writer in the test set, we form 276 positive pairs and 276 negative pairs. The final test data comprises 27,600 signature pairs.

The BHSig-Hindi dataset [19] contains Hindi signatures from 160 writers. Each writer contributed 24 genuine signatures for a specific name, accompanied by 30 skilled forgeries. Consistent with most previous studies [9, 25, 43], we randomly select signatures from 100 writers to form the training set, while the remaining writers’ signatures are used for the test set. Within the training set, we randomly select signatures from five writers to create the validation set. We follow a similar procedure to generate 276 positive and 276 negative test pairs for each writer. The final test data comprises 33,120 signature pairs.

We construct HanSig, a new Chinese signature dataset, to facilitate the development of signature verification systems. We first collected 554,723 names from the Joint College Entrance Exam admissions lists between 1996 and 2002. Next, we generated 885 candidate names based on the frequency distributions of the name distribution in the real world. By generating candidate names, we took precautions to avoid potential legal concerns related to personal information. We collected signatures of these candidate names from 238 writers. To introduce more signing variations in the genuine signatures, each name was signed 20 times in three different styles: neat, normal, and stylish. We then requested the forgers to practice and skillfully imitate the genuine signatures to create skilled forgeries. Overall, HanSig consists of a total of 17,700 genuine signatures and an equal number of skilled forgeries. Fig. 5 provides examples of the collected signatures in different styles and the genuine and forged signatures.

Refer to caption

(a) Examples of collected signatures in three styles

Refer to caption

(b) Examples of collected genuine and forged signatures

Figure 5: Examples of signature images in HanSig. The left, middle, and right images of (a) are the collected signatures written in neat, normal, and stylish styles, respectively. The first row of (b) shows the collected genuine signatures, and the second row of (b) shows the corresponding forged signatures.

The HanSig dataset has several valuable characteristics: (1) The generation of names for signatures addresses concerns regarding personal information and privacy while still preserving the distributional characteristics of real names. (2) HanSig incorporates the real-world property of intra-writer variability by including multiple signature styles. (3) HanSig surpasses existing public Chinese signature datasets in terms of the number of signature samples, providing robust training for signature verification systems. (4) HanSig is advantageous for both random and skilled forgery verification tasks.

We randomly split HanSig into a training set and a test set. The training set comprises 795 names signed by 213 writers, while the test set includes 90 names signed by 25 writers. From the training set, 20 writers’ signatures (78 names) are randomly selected for validation. For each name in the test set, we follow a similar procedure used in the CEDAR and BHSig to form 190 positive pairs and 190 negative pairs. The final test data of HanSig consists of 34,200 signature pairs.

IV-B Data preprocessing

To mitigate the influence of background and position variations in the signature images, we perform several data preprocessing steps without deforming the structure of the signatures. Firstly, we convert the signature images to grayscale. Next, we apply Otsu’s algorithm [44] to transform all background pixel values into 255 while keeping the signature pixels unchanged. This step removes noise in image backgrounds and is critical to datasets (e.g., CEDAR) that have distinct backgrounds in genuine and forged signature images. We also center-crop the signature images and remove the excess blanks around the signatures to eliminate potential misalignment issues caused by signature position variations. Finally, the images are resized to the input size of the network using bilinear interpolation, and the pixel values of the signature images are normalized to a range between 0 and 1.

IV-C Implementation details and evaluation metrics

In our model, we experimentally set C=256𝐶256C=256italic_C = 256, H=16𝐻16H=16italic_H = 16, W=25𝑊25W=25italic_W = 25, and V=32𝑉32V=32italic_V = 32. We construct a training mini-batch using w =18absent18=18= 18 for CEDAR, HanSig, and BHSig-Bengali and w =20absent20=20= 20 for BHSig-Hindi. Furthermore, we set k=5𝑘5k=5italic_k = 5 for all datasets. The constraint margin δ𝛿\deltaitalic_δ for the constraint mining strategy is empirically set as 0.2 for CEDAR and 0.3 for HanSig, BHSig-Bengali, and BHSig-Hindi. In our experiments, we empirically set λ=1𝜆1\lambda=1italic_λ = 1 in the objective function as the weight of the regional losses. We apply Adam optimizer with an initial learning rate of 0.001, and the learning rate decays by a factor of 0.5 every 15 epochs. We train our models for a maximum of 80 epochs and use the validation set for early stopping. All our experiments are implemented using the PyTorch framework.

We report the performance using the common evaluation metrics of signature verification: False Reject Rate (FRR), False Accept Rate (FAR), Equal Error Rate (EER), and Area Under the Curve (AUC). FRR refers to the proportion of genuine signatures mistakenly rejected as forgeries. FAR is the proportion of forgeries mistakenly accepted as genuine signatures. EER is the error rate when FRR is equal to FAR, with the adjustment of the decision threshold dthrsubscript𝑑𝑡𝑟d_{thr}italic_d start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT.

IV-D Method evaluation and analysis

To assess the performance of the proposed MS-SigNet combined with the co-tuplet loss, we conduct experiments comparing it to a simple baseline model and alternative combinations of losses and models. The evaluation aims to determine the effectiveness of the co-tuplet loss compared to the triplet loss [14, 15] and to understand the relative performance of the MS-SigNet. As a simple baseline, we adopt a VGG-16 [45] pretrained on the ImageNet dataset as the feature extractor. Additionally, we train the MS-SigNet with the triplet loss and evaluate its performance for the comparison with the co-tuplet loss. Furthermore, alternative combinations include training the VGG-16 with either the co-tuplet loss or the triplet loss under the same experimental settings.

TABLE I: Performance comparison between different combinations of models and losses (evaluation metrics in %).
Dataset Method FRR FAR EER AUC
CEDAR Simple Baseline (Pretrained VGG-16) 16.67 27.75 23.08 84.82
VGG-16 with triplet loss 17.32 18.48 17.93 87.62
VGG-16 with co-tuplet loss 17.90 15.00 16.45 90.15
MS-SigNet with triplet loss 7.75 5.94 6.92 98.10
MS-SigNet with co-tuplet loss 3.55 3.33 3.51 99.47
BHSig-Bengali Simple Baseline (Pretrained VGG-16) 11.71 22.23 17.04 91.07
VGG-16 with triplet loss 12.70 12.63 12.69 94.80
VGG-16 with co-tuplet loss 14.44 9.08 11.96 95.75
MS-SigNet with triplet loss 9.41 5.21 7.54 98.03
MS-SigNet with co-tuplet loss 6.20 5.93 6.12 98.64
BHSig-Hindi Simple Baseline (Pretrained VGG-16) 17.28 17.71 17.52 90.64
VGG-16 with triplet loss 14.05 15.63 14.94 92.74
VGG-16 with co-tuplet loss 11.91 15.94 14.04 93.86
MS-SigNet with triplet loss 9.16 8.58 8.9 96.94
MS-SigNet with co-tuplet loss 6.56 6.76 6.68 98.28
HanSig Simple Baseline (Pretrained VGG-16) 32.43 19.66 26.31 80.94
VGG-16 with triplet loss 15.60 22.40 19.07 89.47
VGG-16 with co-tuplet loss 14.20 16.21 15.26 92.60
MS-SigNet with triplet loss 9.99 10.82 10.44 95.92
MS-SigNet with co-tuplet loss 7.69 11.85 9.93 96.38

Table I reports the experimental results. The MS-SigNet trained with the co-tuplet loss achieves the best performance across all datasets. Compared to the simple baseline, it improves the EER by up to 19.57 percentage points and improves the AUC by up to 15.44 percentage points. Replacing the co-tuplet loss with the triplet loss for training the MS-SigNet results in worsened EER and AUC. Similarly, training the VGG-16 with the co-tuplet loss yields better performance in terms of EER and AUC compared to training it with the triplet loss. Furthermore, comparing the MS-SigNet to the VGG-16 trained with the same loss function, the MS-SigNet consistently outperforms the VGG-16 counterparts. When trained with the co-tuplet loss and the triplet loss, the MS-SigNet exhibits EER improvements ranging from 5.33 to 12.94 and from 5.15 to 11.01 percentage points over the VGG-16, respectively. Overall, the results validate the effectiveness of the MS-SigNet coupled with the co-tuplet loss across different datasets.

IV-E Ablation studies

In this subsection, we conduct ablation studies to evaluate the importance of each module/branch in our network and the validity of selected operations. To assess the contribution of each module/branch in our proposed MS-SigNet, we remove specific components from the original framework. The results of the ablation studies are reported in Table II.

The findings indicate that removing multilevel feature fusion results in performance degradation across all datasets, especially with a reduction of 3.23 percentage points for CEDAR. This suggests that our proposed fusion mechanism effectively mitigates information loss caused by layer transmission and retains detailed signature stroke information for both global and regional branches. Additionally, the GRCA module demonstrates a performance improvement ranging from 0.28 to 1.27 percentage points. This highlights the effectiveness of our attention mechanism in focusing on important channel information. The results in Table II also indicate that both the global and regional branches contribute to enhancing signature verification performance. The global branch provides more performance gains for CEDAR, while the regional branch makes a greater contribution to the other datasets. These results suggest that multiscale feature learning are crucial for exploiting their respective advantages.

To evaluate the validity of the selected operations in our approach, we compare the performances on CEDAR using different operations while maintaining the same experimental settings. Regarding multilevel feature fusion, the original “multiplication” is compared with the conventional “concatenation” as the fusion strategy. The results in Table III demonstrate that using “multiplication” leads to a significant improvement in EER by 4.32 percentage points compared to using “concatenation.” This indicates the superiority of the multiplicative operation, which promotes interaction between low-level and high-level features during training and contributes to enhanced performance. For model training, the original approach involves training each set of corresponding features with individual co-tuplet losses. In contrast, an alternative approach is employed where the features are concatenated and trained with only one loss. The results in Table III reveal that training “with individual losses” outperforms training “with only one loss” by 5.04 percentage points. This suggests that the original operation allows to learn more discriminative representations and leads to improved performance.

TABLE II: Ablation study to evaluate the signature verification performance without each module/branch in the proposed framework (EER in %). Values in the parentheses indicate the performance degradation (EER increase in %) compared with using each module/branch.
without Module/Branch CEDAR BHSig- BHSig- HanSig
Bengali Hindi
Multilevel feature fusion 6.74 (3.23) 7.29 (1.17) 9.84 (3.16) 11.02 (1.09)
GRCA 4.78 (1.27) 6.59 (0.47) 6.96 (0.28) 10.48 (0.55)
Global branch 6.70 (3.19) 7.34 (1.22) 7.19 (0.51) 10.70 (0.77)
Regional branch 5.14 (1.63) 8.26 (2.14) 10.85 (4.17) 10.76 (0.83)
TABLE III: Performance comparison between different operations on CEDAR (EER in %).
Operation EER
Multilevel feature fusion Concatenation 7.83
Multiplication (Ours) 3.51
Model training With only one loss 8.55
With individual losses (Ours) 3.51

IV-F Visualization analysis

IV-F1 Comparison between extracted features

We compare the 2D projections of extracted features from the simple baseline (pretrained VGG-16), MS-SigNet with triplet loss, and MS-SigNet with co-tuplet loss using the t-distributed stochastic neighbor embedding (t-SNE) algorithm [46]. To ensure clarity, we use signatures from a randomly selected subset of 50 names out of the 90 names in the HanSig test set. Fig. 6 (a) displays the feature space projection of the simple baseline. It shows that genuine and corresponding forged signatures of each name are clustered together. For instance, the red solid circle highlights a cluster from a single name where genuine signatures are visually inseparable from forged signatures represented by “o” and “x” marks, respectively. A similar pattern is observed for another name within the red dashed circle. This suggests that the simple baseline can differentiate between signatures of different names, distinguishing genuine signatures from random forgeries, but it struggles to separate genuine signatures from skilled forgeries in most cases.

Refer to caption

(a) Simple baseline (pretrained VGG-16)

Refer to caption

(b) MS-SigNet+triplet loss

Refer to caption

(c) MS-SigNet+co-tuplet loss

Figure 6: 2D projections of the extracted features of the random 50 names (each name has 20 genuine and 20 skilled-forged signature images) from the HanSig test set using t-SNE [46]. Each marker represents a signature sample: “o” represents the genuine signatures, and “x” represents the forged signatures. The signature samples belonging to different names are displayed in different colors. The red solid circle and dashed circle indicate the samples of the two names.

Fig. 6 (b) presents the feature space of MS-SigNet trained with the triplet loss. It demonstrates a better separation between genuine signatures and skilled forgeries compared to the simple baseline. However, some genuine signatures and their corresponding forgeries remain clustered together. For example, the two red dashed circles indicate separate genuine and forged signatures of a name, while the red solid circle shows that genuine and forged signatures are visually indistinguishable. Additionally, the triplet loss fails to pull genuine signatures of each name closer together. Fig. 6 (c) illustrates the feature space of MS-SigNet trained with the co-tuplet loss. It shows visually separable clusters of genuine signatures and skilled forgeries for each name. The two red solid circles highlight separate clusters of genuine and forged signatures for one name, while the two red dashed circles indicate a similar pattern for another name. These experimental results demonstrate the promising generalization ability of our proposed MS-SigNet with the co-tuplet loss to unseen data.

IV-F2 Comparison between global and regional branches

To highlight the benefits of multiscale feature learning, we provide visual explanations of the convolutional layers in both the global and regional branches. We generate heat maps using Grad-CAM [47, 48] for two genuine signature images and their corresponding forgeries from the test sets of BHSig-Bengali and HanSig. Fig. 7 displays the results obtained from Conv51G and Conv51R (the first convolutional layers of the global and regional branches).

Refer to caption

(a) BHSig-Bengali

Refer to caption

(b) HanSig

Figure 7: Visualizing the convolutional layer of the global branch and the regional branch using Grad-CAM [47, 48]. The images in the first column are the input images. The top two and bottom two images are genuine signatures and the corresponding skilled forgeries, respectively, from the test set of BHSig-Bengali and HanSig. The visualization results generated from Conv51G and Conv51R are shown in the second and third columns.

Conv51G exhibits attention to various areas of the entire image, capturing overall signature information such as outlines and stroke configuration. In comparison, Conv51R focuses more on signature strokes and highlights local details in specific signature regions. For instance, it emphasizes sharp curves in the upper region of the BHSig-Bengali images and slanting lines in the lower region of the HanSig images. Moreover, Conv51R emphasizes both small details (e.g., the end of a vertical line in genuine signatures of BHSig-Bengali) and larger detailed parts (e.g., the entire vertical line in forged signatures of BHSig-Bengali), enabling the model to learn fine-grained differences between signatures. These visualization results indicate that the global and regional branches have distinct focuses and capture different yet complementary signature information. Integrating information obtained from multiple spatial scales allows for the generation of more discriminative features in signature verification.

IV-G Performance comparison with state-of-the-art methods

We present a comparison between our proposed method and several state-of-the-art methods on the three public datasets. For our newly created HanSig, we provide results of several baseline methods for comparison under the same experimental settings. The results are summarized in Tables IV and V, respectively. Note that previous works might report different metrics for their methods. To ensure comparability across different studies, we focus on EER and AUC. In cases where previous works only report Average Error Rate (AER), the average of FRR and FAR, which is considered to be comparable to EER [2]. We also provide additional information about each compared method. Since WD and WI methods employ different training and evaluation schemes, we include WD methods only as references. It is important to mention that certain previous works did not perform noise removal of the image background during data preprocessing, as noted by [2]. However, the image backgrounds of genuine and forged signatures in CEDAR exhibit significant differences. Therefore, we exclude works that reported a 0% error rate without background removal.

TABLE IV: Comparison with existing methods on the CEDAR, BHSig-Bengali, and BHSig-Hindi datasets (evaluation metrics in %). Type refers to the WD or WI approach. #Ref indicates the number of genuine signatures as the references for the questioned signature to be compared with. Metric refers to whether metric learning is used for model training, with “Y” indicating yes and “N” indicating no. “*” refers to AER when FRR equals FAR; hence, it is the same as EER. “**” refers to AER, but FRR and FAR are of unequal value.
Method Typ #Ref Metric CEDAR BHSig-Bengali BHSig-Hindi
EER AUC EER AUC EER AUC
Genetic algorithm [7] WD 12 N 4.67* - - - - -
SigNet-F [22] WD 12 N 4.63 - - - - -
Texture features [19] WD 12 N - - 33.82 - 24.47 -
OC-SVM [49] WD 12 Y 5.60** - - - - -
Micro deformations [39] WD 8 N - - 8.21 - 9.01 -
Duplication model [51] WD 2 N - - 10.67 95.30 11.88 94.15
Graph-based CNN [11] WI 10 Y 12.27 - - - - -
P2S metric [30] WI 5 Y 9.29 - - - - -
Morphology [41] WI 1 N 11.59 - - - - -
Surroundedness [42] WI 1 N 8.33* - - - - -
IDN [25] WI 1 N 3.62 - 4.68** - 6.96** -
SigNet [9] WI 1 Y - - 13.89* - 15.36* -
SURDS [40] WI 1 Y - - 12.66** - 10.50** -
DeepHSV [43] WI 1 Y - - 11.92 95.50 13.34 94.00
Siamese network [10] WI 1 Y 8.50 - - - - -
MSDN [2] WI 1 Y 6.74 - - - - -
MS-SigNet+co-tuplet loss WI 1 Y 3.51 99.47 6.12 98.64 6.68 98.28
TABLE V: Comparison with baseline methods on the HanSig dataset (evaluation metrics in %).
Method FRR FAR EER AUC
SigNet [9] 32.43 19.66 26.31 80.94
Pretrained VGG-16 [45] 19.18 36.92 28.86 78.34
Pretrained ResNet-18 [50] 27.57 37.05 32.42 73.77
MS-SigNet+co-tuplet loss 7.69 11.85 9.93 96.38

Table IV demonstrates that our MS-SigNet with the co-tuplet loss achieves superior performance compared to other methods on CEDAR. Our method outperforms OC-SVM [49], graph-based CNN [11], P2S metric [30], Siamese network [10], and MSDN [2], which also employ metric learning for signature verification. Notably, our approach surpasses MSDN [2], which solely utilizes local regions from input segmentation for feature learning. This result highlights the efficacy of our proposed multiscale feature learning.

Table IV highlights the competitive result of our proposed MS-SigNet with the co-tuplet loss on BHSig-Bengali. Our method surpasses SigNet [9], SURDS [40], and DeepHSV [43] by 7.77, 6.54, and 5.8 percentage points in terms of EER, respectively. These three methods employ typical metric learning losses mentioned in Section III-E1 for signature verification. IDN [25] achieves an AER of 4.68%, which is better than our method on BHSig-Bengali. However, our proposed method outperforms IDN on the CEDAR and BHSig-Hindi datasets.

The proposed MS-SigNet and co-tuplet loss achieve substantial performance improvement on BHSig-Hindi. Among the WI methods, our method achieves an EER of 6.68%, which is significantly lower than the EERs of SigNet [9] (15.36%), SURDS [40] (10.50%), and DeepHSV [43] (13.34%), and the AER of IDN [25] (6.96%). This comparison with other state-of-the-art methods showcases the competitiveness and effectiveness of our proposed approach.

Since HanSig is a newly-created dataset, we compare the proposed method against three baselines on this dataset. The first baseline is SigNet [9], which utilizes a Siamese network architecture for offline signature verification. We use the provided source code from the author and ensure that the data subsets used for SigNet are consistent with our method. The second and third baselines are VGG-16 [45] and ResNet-18 [50] pretrained on the ImageNet dataset, respectively. As depicted in Table V, our proposed method demonstrates superior performance on HanSig when compared to the three baseline methods.

V Conclusion

In this study, we propose a multiscale feature learning network and a new metric learning loss to build an automatic handwritten signature verification system. The proposed MultiScale Signature feature learning Network (MS-SigNet) captures complementary signature information from multiple spatial scales. It can integrate the information to generate discriminative features for static signature verification. The multilevel feature fusion and global-regional channel attention (GRCA) modules designed for the two-branch structure provide further performance gains. To enhance the discriminative capability of our verification system, we propose the co-tuplet loss, a novel metric learning loss function. Experimental results demonstrate that our MS-SigNet with the co-tuplet loss surpasses the state-of-the-art methods on various benchmark datasets, showcasing its effectiveness in signature verification across different languages. While our results are promising, further improvements could be made by developing alternative methods to integrate multiple signature information.

References

  • [1] A. K. Jain, A. Ross, and S. Prabhakar, “An introduction to biometric recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 1, pp. 4–20, 2004.
  • [2] L. Liu, L. Huang, F. Yin, and Y. Chen, “Offline signature verification using a region based deep metric learning network,” Pattern Recognit., vol. 118, p. 108009, 2021.
  • [3] J. Vargas, M. Ferrer, C. Travieso, and J. B. Alonso, “Off-line signature verification based on grey level information using texture features,” Pattern Recognit., vol. 44, no. 2, pp. 375–385, 2011.
  • [4] G. Pirlo and D. Impedovo, “Verification of static signatures by optical flow analysis,” IEEE Trans. Hum.-Mach. Syst., vol. 43, no. 5, pp. 499–505, 2013.
  • [5] ——, “Cosine similarity for analysis and verification of static signatures,” IET Biom., vol. 2, no. 4, pp. 151–158, 2013.
  • [6] M. I. Malik, M. Liwicki, A. Dengel, S. Uchida, and V. Frinken, “Automatic signature stability analysis and verification using local features,” in Proc. 14th Int. Conf. Front. Handwrit. Recognit., Hersonissos, Greece, 2014, pp. 621–626.
  • [7] M. Sharif, M. A. Khan, M. Faisal, M. Yasmin, and S. L. Fernandes, “A framework for offline signature verification system: Best features selection approach,” Pattern Recognit. Lett., vol. 139, pp. 50–59, 2020.
  • [8] H. Rantzsch, H. Yang, and C. Meinel, “Signature embedding: Writer independent offline signature verification with deep metric learning,” in Proc. 12th Int. Symp. Vis. Comput., vol. 10073, Las Vegas, NV, USA, 2016, pp. 616–625.
  • [9] S. Dey, A. Dutta, J. I. Toledo, S. K. Ghosh, J. Lladós, and U. Pal, “Signet: Convolutional siamese network for writer independent offline signature verification,” 2017, arXiv:1707.02131.
  • [10] Z.-J. Xing, F. Yin, Y.-C. Wu, and C.-L. Liu, “Offline signature verification using convolution Siamese network,” in Proc. Int. Conf. Graph. Image Process., vol. 10615, Qingdao, China, 2018, p. 106151I.
  • [11] P. Maergner, V. Pondenkandath, M. Alberti, M. Liwicki, K. Riesen, R. Ingold, and A. Fischer, “Combining graph edit distance and triplet networks for offline signature verification,” Pattern Recognit. Lett., vol. 125, pp. 527–533, 2019.
  • [12] Q. Wan and Q. Zou, “Learning metric features for writer-independent signature verification using dual triplet loss,” in Proc. 25th Int. Conf. Pattern Recognit., Milan, Italy, 2021, pp. 3853–3859.
  • [13] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 2, New York, NY, USA, 2006, pp. 1735–1742.
  • [14] M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 16, Whistler, BC, Canada, 2003, pp. 41–48.
  • [15] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10, no. 2, pp. 207–244, 2009.
  • [16] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 29, Barcelona, Spain, 2016, pp. 1857–1865.
  • [17] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss with general pair weighting for deep metric learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, 2019, pp. 5017–5025.
  • [18] M. K. Kalera, S. Srihari, and A. Xu, “Offline signature verification and identification using distance statistics,” Int. J. Pattern Recognit. Artif. Intell., vol. 18, no. 07, pp. 1339–1360, 2004.
  • [19] S. Pal, A. Alaei, U. Pal, and M. Blumenstein, “Performance of an off-line signature verification method based on texture features on a large Indic-script signature dataset,” in Proc.12th IAPR Int. Work. Doc. Anal. Syst., Santorini, Greece, 2016, pp. 72–77.
  • [20] M. Diaz, M. A. Ferrer, D. Impedovo, M. I. Malik, G. Pirlo, and R. Plamondon, “A perspective analysis of handwritten signature technology,” ACM Comput. Surv., vol. 51, no. 6, pp. 1–39, 2019.
  • [21] M. M. Hameed, R. Ahmad, M. L. M. Kiah, and G. Murtaza, “Machine learning-based offline signature verification systems: A systematic review,” Signal Process. Image Commun., vol. 93, p. 116139, 2021.
  • [22] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, “Learning features for offline handwritten signature verification using deep convolutional neural networks,” Pattern Recognit., vol. 70, pp. 163–176, 2017.
  • [23] L. G. Hafemann, L. S. Oliveira, and R. Sabourin, “Fixed-sized representation learning from offline handwritten signatures of different sizes,” Int. J. Doc. Anal. Recognit., vol. 21, no. 3, pp. 219–232, 2018.
  • [24] S. Bonde, P. Narwade, and R. Sawant, “Offline signature verification using convolutional neural network,” in Proc. IEEE Int. Conf. Signal Process. Comput., Noida, India, 2020, pp. 119–127.
  • [25] P. Wei, H. Li, and P. Hu, “Inverse discriminative networks for handwritten signature verification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, 2019, pp. 5764–5772.
  • [26] A. Soleimani, B. N. Araabi, and K. Fouladi, “Deep multitask metric learning for offline signature verification,” Pattern Recognit. Lett., vol. 80, pp. 84–90, 2016.
  • [27] Y. Zhao, C. Shen, X. Yu, H. Chen, Y. Gao, and S. Xiong, “Learning deep part-aware embedding for person retrieval,” Pattern Recognit., vol. 116, p. 107938, 2021.
  • [28] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1, Boston, MA, USA, 2015, pp. 815–823.
  • [29] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a ”Siamese” time delay neural network,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 6, Denver, CO, USA, 1993, p. 737–744.
  • [30] Y. Zhu, S. Lai, Z. Li, and L. Jin, “Point-to-set similarity based deep metric learning for offline signature verification,” in Proc. 17th Int. Conf. Front. Handwrit. Recognit., Dortmund, Germany, 2020, pp. 282–287.
  • [31] B. Yu and D. Tao, “Deep metric learning with tuplet margin loss,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Seoul, Korea, 2019, pp. 6490–6499.
  • [32] M. A. Ferrer, M. Diaz-Cabrera, and A. Morales, “Static signature synthesis: A neuromotor inspired approach for biometrics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 667–680, 2015.
  • [33] J. Ortega-Garcia, J. Fierrez-Aguilar, D. Simon, J. Gonzalez, M. Faundez-Zanuy, V. Espinosa, A. Satue, I. Hernaez, J.-J. Igarza, and C. Vivaracho, “MCYT baseline corpus: A bimodal biometric database,” IEE Proc.-Vis. Image Signal Process., vol. 150, no. 6, pp. 395–401, 2003.
  • [34] A. Soleimani, K. Fouladi, and B. N. Araabi, “UTSig: A Persian offline signature dataset,” IET Biom., vol. 6, no. 1, pp. 1–8, 2016.
  • [35] M. Liwicki, M. I. Malik, C. E. Van Den Heuvel, X. Chen, C. Berger, R. Stoel, M. Blumenstein, and B. Found, “Signature verification competition for online and offline skilled forgeries (SigComp2011),” in Proc. Int. Conf. Doc. Anal. Recognit. (ICDAR), Beijing, China, 2011, pp. 1480–1484.
  • [36] K. Yan, Y. Zhang, H. Tang, C. Ren, J. Zhang, G. Wang, and H. Wang, “Signature detection, restoration, and verification: A novel Chinese document signature forgery detection benchmark,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), New Orleans, LA, USA, 2022, pp. 5163–5172.
  • [37] J. Wei, Q. Wang, Z. Li, S. Wang, S. K. Zhou, and S. Cui, “Shallow feature matters for weakly supervised object localization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Nashville, TN, USA, 2021, pp. 5989–5997.
  • [38] X. Qin, Z. Wang, Y. Bai, X. Xie, and H. Jia, “FFA-Net: Feature fusion attention network for single image dehazing,” in Proc. 34th AAAI Conf. Artif. Intell., vol. 34, New York, NY, USA, 2020, pp. 11 908–11 915.
  • [39] Y. Zheng, B. K. Iwana, M. I. Malik, S. Ahmed, W. Ohyama, and S. Uchida, “Learning the micro deformations by max-pooling for offline signature verification,” Pattern Recognit., vol. 118, p. 108008, 2021.
  • [40] S. Chattopadhyay, S. Manna, S. Bhattacharya, and U. Pal, “SURDS: Self-supervised attention-guided reconstruction and dual triplet loss for writer independent offline signature verification,” in Proc. 26th Int. Conf. Pattern Recognit., Montreal, QC, Canada, 2022, pp. 1600–1606.
  • [41] R. Kumar, L. Kundu, B. Chanda, and J. Sharma, “A writer-independent off-line signature verification system based on signature morphology,” in Proc. Int. Conf. Intell. Interact. Technol. Multimed. (IITM), New York, NY, USA, 2010, pp. 261–265.
  • [42] R. Kumar, J. Sharma, and B. Chanda, “Writer-independent off-line signature verification using surroundedness feature,” Pattern Recognit. Lett., vol. 33, no. 3, pp. 301–308, 2012.
  • [43] C. Li, F. Lin, Z. Wang, G. Yu, L. Yuan, and H. Wang, “DeepHSV: User-independent offline signature verification using two-channel CNN,” in Proc. Int. Conf. Doc. Anal. Recognit. (ICDAR), Sydney, NSW, Australia, 2019, pp. 166–171.
  • [44] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Trans. Syst. Man Cybern. Syst., vol. 9, no. 1, pp. 62–66, 1979.
  • [45] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent. (ICLR), San Diego, CA, USA, 2015. [Online]. Available: https://doi.org/10.48550/arXiv.1409.1556
  • [46] L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/vandermaaten08a.html
  • [47] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” Int. J. Comput. Vis., vol. 128, no. 2, pp. 336–359, 2020.
  • [48] J. Gildenblat. (2021) Pytorch library for CAM methods. [Online]. Available: https://github.com/jacobgil/pytorch-grad-cam
  • [49] Y. Guerbai, Y. Chibani, and B. Hadjadji, “The effective use of the one-class SVM classifier for handwritten signature verification based on writer-independent parameters,” Pattern Recognit., vol. 48, no. 1, pp. 103–113, 2015.
  • [50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.
  • [51] M. Diaz, M. A. Ferrer, and R. Sabourin, “Approaching the intra-class variability in multi-script static signature evaluation,” in Proc. 23rd Int. Conf. Pattern Recognit., Cancun, Mexico, 2016, pp. 1147–1152.