CECL-Net: Contrastive Learning and Edge-Reconstruction-Driven Complementary Learning Network for Image Forgery Localization

Dai, Gaoyuan; Chen, Kai; Huang, Linjie; Chen, Longru; An, Dongping; Wang, Zhe; Wang, Kai

doi:10.3390/electronics13193919

Open AccessArticle

CECL-Net: Contrastive Learning and Edge-Reconstruction-Driven Complementary Learning Network for Image Forgery Localization

by

Gaoyuan Dai

^1,2,

Kai Chen

²,

Linjie Huang

²,

Longru Chen

²,

Dongping An

²,

Zhe Wang

² and

Kai Wang

^3,*

¹

School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510642, China

²

China Telecom Stocks Co., Ltd., Beijing 100033, China

³

School of Computer, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3919; https://doi.org/10.3390/electronics13193919

Submission received: 2 September 2024 / Revised: 28 September 2024 / Accepted: 29 September 2024 / Published: 3 October 2024

(This article belongs to the Special Issue Image Processing Based on Convolution Neural Network)

Download

Browse Figures

Versions Notes

Abstract

:

While most current image forgery localization (IFL) deep learning models focus primarily on the foreground of tampered images, they often neglect the essential complementary background semantic information. This oversight tends to create significant gaps in these models’ ability to thoroughly interpret and understand a tampered image, thereby limiting their effectiveness in extracting critical tampering traces. Given the above, this paper presents a novel contrastive learning and edge-reconstruction-driven complementary learning network (CECL-Net) for image forgery localization. CECL-Net enhances the understanding of tampered images by employing a complementary learning strategy that leverages foreground and background features, where a unique edge extractor (EE) generates precise edge artifacts, and edge-guided feature reconstruction (EGFR) utilizes the edge artifacts to reconstruct a fully complementary set of foreground and background features. To carry out the complementary learning process more efficiently, we also introduce a pixel-wise contrastive supervision (PCS) method that attracts consistent regions in features while repelling different regions. Moreover, we propose a dense fusion (DF) strategy that utilizes multi-scale and mutual attention mechanisms to extract more discriminative features and improve the representational power of CECL-Net. Experiments conducted on two benchmark datasets, one Artificial Intelligence (AI)-manipulated dataset and two real challenge datasets, indicate that our CECL-Net outperforms seven state-of-the-art models on three evaluation metrics.

Keywords:

image forgery localization; complementary learning; pixel-wise contrastive supervision; edge-guided feature reconstruction

1. Introduction

Given the widespread accessibility of user-friendly image editing software such as Adobe Photoshop (https://www.adobe.com/products/photoshop.html), Corel Draw (https://www.coreldraw.com/cn/), and GIMP (https://www.gimp.org), image manipulation has become a commonplace activity on social media platforms. Users often create and share amusing edited photos on these platforms, facilitating interaction and enhancing everyday life experiences. However, this ease of image manipulation also opens the door to the potential misuse of this technology, increasing the risk of disinformation crimes committed with malicious intent. Therefore, in the field of image forensics today, accurately identifying tampered pixels has become an urgent research focus.

Earlier studies tend to focus on a specific type of image tampering, such as splicing [1,2,3,4], copy–move [5,6,7,8,9,10], and inpainting [11,12,13,14,15,16]. However, these proposed methods show a high localization ability on only one dataset type, and their practicability is poor. Thanks to the development of Convolutional Neural Networks (CNNs), more and more general image-tampering localization methods have been proposed. These methods can detect or locate multiple tampering types. For example, Zhou et al. [17] developed a double-stream network model (RGB-N Net) for end-to-end detection. This model utilizes the steganalysis-rich model and Faster R-CNN and adds a noise stream to the original Faster R-CNN to provide additional auxiliary features using noise features. Wu et al. [18] proposed a unified end-to-end deep neural network called ManTra-Net. This model treats image forgery localization (IFL) as an anomaly detection task capable of locating various image-tampering operations. Li et al. [19] presented an efficient end-to-end high-confidence image-manipulation localization network. This network fuses multi-scale adjacent features extracted from RGB streams and introduces morphological operations to extract multi-scale edge information. Xia et al. [20] classified features into primary, intermediate, and advanced levels and introduce two fusion modules to achieve superior fusion results. Chen et al. [21] introduced a multi-view and multi-scale supervised network (MVSS-Net). This network primarily consists of an edge-supervision branch and a noise-sensitive branch, using noise view, boundary image, and real image for feature learning. Liu et al. [22] proposed a progressive spatio-channel correlation network (PSCC-Net). This network employs a progressive mechanism to predict the operation masks on all scales. It uses a spatio-channel correlation module to guide feature extraction via space attention and channel attention. Fahim et al. [23] proposed a contrastive learning method (CFL-Net) for IFL. This method leverages the difference in feature distribution between untampered and manipulated regions of each image sample and does not focus on specific forgery footprints.

Conventional IFL methods mainly focus on the tampered image’s foreground details, often disregarding the complementary background context. These methods tend to greatly depend on the features of image tampering, neglecting the unique feature distributions in tampered and genuine regions. As a result, they fail to fully comprehend tampered images, leading to a reduced localization effectiveness when dealing with various complex image-tampering forms and subtle tampering traces. Moreover, while high-level features enriched with semantic information are crucial for image-attribute recognition, many methods overly stress their importance. This serious imbalance neglects the critical role of features extracted at each stage of the backbone network, impeding accurate pixel-level localizations.

To address the above limitations, this paper presents a contrastive learning and edge-reconstruction-driven complementary learning network (CECL-Net). CECL-Net enhances the model’s understanding of the overall information within tampered images by engaging in complementary learning through a set of complementary foreground and background features. The foreground semantics furnish the model with essential information regarding the location of the tamper region, whereas the background semantics offer auxiliary cues from areas unaffected by tampering. Making the model pay special attention to the tampered and untampered regions independently provides it with a comprehensive understanding of the image, thereby enhancing its analytical capabilities. We propose a novel pixel-wise contrastive supervision method and introduce an edge-guided feature-reconstruction module to facilitate complementary learning. As demonstrated in Figure 1, the tampering localization accuracy of our network model is better than that of the traditional method, which only uses foreground features for supervised learning. Additionally, we introduce an innovative dense fusion strategy to combine features of different levels. This strategy provides context-rich, globally aware features for effective complementary learning.

The contributions of this study are summarized as follows:

We propose a contrastive learning and edge-reconstruction-driven complementary learning (CECL-Net) model for image forgery localization. An innovative complementary learning strategy is proposed to enrich the model’s interpretation of tampered images by synergistically analyzing foreground and background features, thereby improving its forensic capabilities. The codes of CECL-Net are publicly available at https://github.com/dream-sky1213/CECL-Net (accessed on 28 September 2024).
An improved pixel-wise contrastive supervision strategy and a novel edge-guided feature-reconstruction module are proposed to drive complementary learning. A dense fusion module is proposed to smoothly fuse the multi-layer feature information extracted at each stage of the backbone network, aiming to yield global contextual information that is comprehensive and feature-abundant.
Our proposed CECL-Net introduces a significant advancement in IFL, addressing key challenges and offering improved performance. The accompanying experiments validate the effectiveness of CECL-Net, highlighting its practical applicability and contributions to the field.

The rest of this paper is organized as follows: Section 2 provides a detailed description of complementary learning and contrastive learning. Section 3 elaborates on the proposed CECL-Net. Section 4 reports and analyzes the experimental results. Section 5 summarizes this study.

2. Related Works

2.1. Complementarity Learning

Numerous studies [24,25,26,27,28] have aimed to enhance the model’s representational and generalization capabilities by learning complementary features. However, most complementary learning methods employ traditional multi-branch parallel structures to extract these features, leading to larger and more complex network models. UP-Net [26] involves learning through two structurally identical branches. The initial-prediction branch uses multi-level contextual information to identify inconsistencies at different scales, while the final-prediction branch optimizes abstract coding features, integrating different levels of coding features to delineate finer tampering objects. Furthermore, U-Net [25] directs the model’s attention to the foreground object of the image through an attention mechanism between the instance-segmentation branch and the semantic-segmentation branch, strategically incorporating the foreground information into the background. DML [27] promotes mutual benefits among neural network models through inter-network learning or teaching. To construct a more extensive network, Dual-Net [24] employs two parallel CNNs for complementary feature learning, focusing on global and local features, respectively. It effectively merges them using an adaptive attention mechanism, enhancing the image classification performance. MC-Net [28] utilizes a two-branch network to operate the foreground and background features. Then, they propose a mutual attentive module to blend the foreground and background information to advance communication across the foreground and background branches. Unlike the existing method, we implement complementary learning for the model by conducting pixel-wise contrastive supervision and edge-guided feature reconstruction on a pair of fully complementary foreground and background features.

2.2. Contrastive Learning

Contrastive learning was originally widely applied to unsupervised learning. Unlike the traditional loss function, which mainly focuses on the error between the predicted result and the target value, its main idea is to learn by comparing different data samples [29,30]. By minimizing the contrastive loss, semantically similar or related samples will be drawn closer within the feature-representation space, while dissimilar or unrelated samples will be distanced.

Recent studies [31,32,33,34] have introduced contrastive learning into supervised learning to enhance the model’s ability to handle complex tasks. Sun et al. [33] proposed a Dual Contrastive Learning (DCL) framework for face-forgery detection. DCL creates different views as positive samples through specific data transformations and conducts contrastive learning at varying granularities to explore the correlations among samples and inconsistencies within samples. Fahim et al. [23] proposed a contrastive learning method (CFL-Net) to leverage the difference in feature distribution between untampered and manipulated regions of each image sample and do not focus on specific forgery footprints. Unlike CFL-Net, we simultaneously apply contrastive learning to the feature extraction of both the foreground and background in tampered images, driving the model towards complementary learning. By minimizing the contrastive loss, we differentiate the feature distribution between the tampered and authentic regions while consolidating the internal feature distribution of the foreground and background in the tampered image. This strategy significantly enhances the model’s ability to understand and adapt to complete data and further improves the efficiency of complementary learning.

2.3. Image-Forgery Localization

IFL represents a pivotal area of inquiry within media forensics and computer vision communities. It primarily addresses the detection of alterations such as splicing, copy–move, and content removal in images. Currently, deep learning models for IFL have demonstrated outstanding performance. Within these methods, two key technologies play a crucial role: cross-domain feature fusion [18,21,23,35,36] and edge assistance [4,21,37].

Incorporating noise views or frequency insights has proven beneficial for detecting subtle tampering traces that are often imperceptible in standard RGB views. ManTra-Net [18] directly feeds the noise map to a deep neural network together with the input image. Both MVSS-Net [21] and EMT-Net [35] use noise branches and edge supervision branches to learn semantic-agnostic, and thus more generalizable, features. CFL-Net [23] uses two encoders to extract RGB features and noise features, respectively. More recently, HiFi-Net [36] was found to extract the features of forged images by applying the Gaussian Laplacian (LoG) transformation to the CNN feature map in the feature extractor and utilizing the existence of image-generating artifacts in RGB and frequency domains.

Empirical evidence suggests that the location information of boundary artifacts significantly aids in detecting tampered areas. To exploit this, methods incorporating boundary supervision have been developed to identify forgery traces around tampered zones. MVSS-Net [21] utilizes Sobel filters to establish edge-supervision branches, which yield more focused feature responses near the forged regions. In TA-Net [37], edge information is enhanced through the Operator-Induced Module (OIM) to boost the network’s ability to perceive the boundaries of image-manipulation areas. In this paper, we utilize edges to drive the model towards complementary learning, thereby enhancing the model’s attention to the subtle edges of tampered objects.

3. Methods

3.1. Network Overview

Motivation: From a biological perspective, our visual system does not solely focus on foreground objects when processing observable inputs. Instead, it simultaneously handles foreground and background information [38,39]. Background information, such as other objects, colors, and light in the scene, provides a context that assists our understanding and interpretation of foreground objects. When we process both foreground and background objects concurrently, we can determine the boundaries of an object by comparing its color, texture, and other characteristics with those of the background, enabling a more accurate segmentation of the foreground object. Notably, some animals, equipped with high survival intelligence, use colors or textures in their environment to camouflage themselves, blurring edge information to blend into the background [40]. For instance, geckos can change their skin color to merge with the environment, while octopuses can mimic the texture and shape of surrounding objects to avoid being detected by predators. Therefore, understanding and mastering edge knowledge can enhance our ability to uncover and identify these concealed objects.

Introduction of CECL-Net: The structural overview of CECL-Net is shown in Figure 2. CECL-Net introduces an innovative complementary learning strategy based on the abovementioned motivation. More specifically, DF utilizes features extracted from various layers of the backbone network to obtain complementary foreground and background features rich in global information. Subsequently, complementary learning is driven by separately conducting a pixel-wise contrastive supervision for each complementary feature and utilizing edge information extracted by EE for feature reconstruction.

3.2. Pyramid Vision Transformer

Pyramid Vision Transformer (PVT) is a novel transformer model [41], characterized as using a visual pyramid structure. This unique design equips PVT with the ability to generate multi-scale and high-resolution feature maps, enhancing its effectiveness in processing visual information. Additionally, PVT introduces a layer known as Spatial Reduction Attention (SRA), which significantly reduces the model’s computation and memory requirements, thereby increasing PVT’s efficiency in practical applications. PVT consists of four stages, with the output resolution of each stage ranging from 4 steps to 32 steps. For an input

I

of size

H \times W \times 3

, each stage of PVT can generate stage-wise features

{F_{b}^{i}}

of size

{H / 2^{i} \times W / 2^{i} \times C_{i}}

, where

{i = 1, 2, 3, 4}

. In this study, we take the pretrained PVT, which has been trained on the ImageNet-1K dataset, as the backbone.

In subsequent work, we feed

{F_{b}^{1}, \dots, F_{b}^{4}}

into the DF for feature fusion (Section 3.3), while

{F_{b}^{1}, F_{b}^{4}}

are sent to the edge extractor for edge-artifact extraction.

3.3. Dense Fusion

It is recognized that shallow features derived from a backbone network primarily contain micro-information, such as edges, colors, and textures. In contrast, deeper features encompass more macro-level semantic information. DF is designed to fuse various hierarchical features extracted by a backbone network seamlessly and efficiently, resulting in the extraction of a more discriminative and refined complementary feature set. As depicted in Figure 3, DF is composed of two stages. In the first stage, we devised a multi-scale feature extractor to extract abundant contextual information. Subsequently, we designed a multi-layer dense interaction architecture employing mutual attention mechanisms. This design delves deeper into the crucial inter-relationships between hierarchical cross-scale features and fuses them smoothly.

(1) Multi-scale-feature extractor: The first stage of DF focuses on multi-scale feature mining. Given that the PVT-based backbone network may not extract abundant contextual information, obtaining multi-scale features from this stage solely through the convolution processing of phase features could be challenging. To overcome these issues, we drew inspiration from the Inception module and Res2Net module [42] and subsequently developed a multi-scale feature extractor (MFE) within the DF. The MFE incrementally applies dilated convolutions at a progressively increasing dilated convolution rate, thereby gradually widening the contextual understanding. More specifically, we first reduce the number of channels of the original feature,

F_{b}^{i}

, using a

1 \times 1

convolution to facilitate subsequent processing. We utilize four branches to capture features at different scales, sequentially processed with a dilated convolution set at a dilation rate of

{3, 5, 7}

. Each branch is equipped with an asymmetric convolution that matches the size of the dilated convolution, reducing the computational effort. The output of each branch is added to the input of the next branch. Concurrently, we use residual connections to enlarge the receptive field successively. The general form of an operation is defined as follows:

D o u t_{k}^{i} {\begin{cases} C o n v_{3 \times 1} (C o n v_{1 \times 3} (C o n v (F_{b}^{i}))) k = 1 \\ C o n v_{s} (C o n v (F_{b}^{i}) + D o u t_{k - 1}^{i}) k = 2, 3, 4 \end{cases} .

(1)

where

F_{b}^{i}

denotes the

i t h

feature map produced by the backbone network,

k

is the branch number,

D o u t_{k}^{i}

represents the output of the

k t h

branch in the first stream, and

+

refers to an element-addition operation.

C o n v_{s}

denotes the stacked convolutional layer mentioned above.

C o n v_{3 \times 1} (\cdot)

refers to a

3 \times 1

convolution operation and

C o n v_{1 \times 3} (\cdot)

refers to a

1 \times 3

convolution operation. Finally, we add the output of the four branches to obtain the result, followed by

C B R_{3 \times 3}

(

3 \times 3

Convolution + BatchNorm + ReLU) to obtain the output feature

F_{b}^{d i}

, with

i \in {1, 2, 3, 4}

being embedded with multi-scale information, which is computed as follows:

F_{b}^{d i} = C B R_{3 \times 3} (A d d_{k = 1}^{4} (D o u t_{k}^{i})) .

(2)

where

A d d_{k = 1}^{4}

refers to the element addition of all four branches.

(2) The dense interaction of multi-layer-features: The second stage of DF is the dense interaction–fusion of multi-layer features. To more effectively reveal and understand the intrinsic relationships between hierarchical features and to realize a smoother and more coherent integration of features at each stage, we designed a mutual attention mechanism that facilitates in-depth interactions between cross-level semantic features

{F_{b}^{r 1}, \dots, F_{b}^{r 4}}

. Specifically, we upsample each layer feature to be the same as

F_{b}^{r 1}

to obtain the attention map of the designated level, denoted as

W_{f a i} = σ (U P (F_{b}^{r i}))

, where

U P (\cdot)

denotes the upsampling operation, and

σ

represents the sigmoid function. Then, we interact each attention map with the upsampled features from other layers through an element-wise multiplication operation. After the interaction, each feature of the layer is added to obtain the smooth feature. Finally, we concatenate the processed features from each layer with a

3 \times 3

convolution to derive the final fusion feature,

F^{f}

, which can be represented as follows:

{\begin{matrix} {F_{w}}^{r 1} = {F_{b}}^{r 1} + W_{f a 1} \times (F^{r 2} + F^{r 3} + F^{r 4}) \\ {F_{w}}^{r 2} = F^{r 2} + W_{f a 2} \times ({F_{b}}^{r 1} + F^{r 3} + F^{r 4}) \\ {F_{w}}^{r 3} = F^{r 3} + W_{f a 3} \times ({F_{b}}^{r 1} + F^{r 2} + F^{r 4}) \\ {F_{w}}^{r 4} = F^{r 4} + W_{f a 4} \times ({F_{b}}^{r 1} + F^{r 2} + F^{r 3}) \end{matrix},

(3)

F^{f} = C o n v_{3 \times 3} (C o n c a t ({F_{w}}^{r 1}, {F_{w}}^{r 2}, {F_{w}}^{r 3}, {F_{w}}^{r 4})) .

(4)

where

F^{r i}

denotes the upsampled feature of each layer and

\times

refers to the element multiplication operation.

C o n v_{3 \times 3} (\cdot)

refers to the

3 \times 3

convolution operation. The background feature is obtained by inverting the fusion feature,

F^{f}

, denoted as

F^{b} + F^{f} = 1

. Subsequently, we will leverage these complementary foreground and background features to realize complementary learning.

3.4. Edge Extractor

Accurate prior information of edges is crucial in IFL. Leveraging object boundary data allows us to effectively pinpoint tampered images, even when the signs of manipulation are subtle. As shown in Figure 4, we developed EE to obtain edge artifacts to guide subsequent complementary feature reconstructions. In EE, we merge the detail-oriented low-level feature,

f_{b}^{1}

, with the spatially informative high-level feature,

f_{b}^{4}

, to accurately represent edge information associated with the object. Meanwhile, a multi-scale technique is utilized to preserve the integrity of fine edge details.

Specifically, two

1 \times 1

convolution layers are first used to change the channels of

f_{b}^{1}

and

f_{b}^{4}

to 64(

f_{b}^{1'}

)and 256(

f_{b}^{4'}

), respectively. Then, we concatenate the feature

f_{b}^{1'}

and the upsampled

f_{b}^{4'}

followed by one

C B R_{3 \times 3}

to obtain the initial fused edge feature

f_{e}^{i n i}

:

f_{e}^{i n i} = C B R_{3 \times 3} (C o n c a t (C o n v_{1 \times 1} (f_{b}^{1}), U P (C o n v_{1 \times 1} (f_{b}^{4})))) .

(5)

Then, we input the fused feature

f_{e}^{i n i}

into two dilated convolution groups at a dilation rate of

d \in {1, 2}

, respectively. This approach broadens the receptive field, enabling the full exploration of multi-scale features. Subsequently, we concatenate the outputs from diverse dilated convolutions into one and multiply it with the learnable coefficient,

ϑ

. Then, we obtain the multi-scale feature guide map,

f_{m s g}

, by a

1 \times 1

convolution. After this, we multiply

f_{m s g}

with the initial fused feature,

f_{e}^{i n i}

, followed by one

C B R_{3 \times 3}

, and then, we add it with the input feature,

F_{b}^{1}

, which is rich in low-level semantics. Finally, the output,

e_{o u t}

, is obtained by one

1 \times 1

convolution layer and the sigmoid function.

f_{m s g} = C o n v (ϑ (C o n c a t (C B R_{3 \times 3}^{d = 1} (f_{e}^{i n i}), C B R_{3 \times 3}^{d = 2} (f_{e}^{i n i})))),

(6)

e_{o u t} = σ (C o n v (C B R_{3 \times 3} ((f_{m s g} \times f_{e}^{i n i})) + f_{b}^{1})) .

(7)

where

ϑ

denotes a learnable matrix, and

C B R_{3 \times 3}^{d = i} (\cdot)

is a sequential operation that combines a

3 \times 3

dilated convolution with the dilation rate

i

, a batch normalization, and a ReLU function.

3.5. Proposed Complementary Learning Strategy

(1) Complementary learning: In the field of IFL, accurately segregating the foreground from the background in tampered images is important. Nevertheless, discerning the foreground can prove challenging when an image has been meticulously altered. Consequently, we incorporated the analysis of the background information of the tampered image to aid in this distinction. To this end, we introduce an innovative contrastive learning and edge-reconstruction-driven complementary learning strategy that leverages the distinct qualities of foreground and background features to address the challenges of IFL. Specifically, our method employs pixel-wise contrastive supervision combined with edge-guided feature reconstruction, utilizing a paired set of intensively fused, fully complementary foreground and background features (see Section 3.3). This innovative approach involves the application of complementary masks that collaboratively supervise both the foreground and background features, which are enhanced through contrastive and reconstructive processes. By addressing the high interdependence between the initial foreground and background features, our technique directly facilitates the generation of precise foreground and background maps in the subsequent predictions. This is crucial for ensuring the accurate localization of image forgeries. Ultimately, we utilize the prediction map derived from the foreground features as the definitive result, allowing for a more reliable identification of tampered areas. Moreover, by concurrently emphasizing both foreground and background features, our method delivers a richer contextual understanding of the image. This enhanced context not only aids the model in comprehending the overall semantics of the image but also significantly bolsters its resilience against a wide array of tampering techniques. By effectively leveraging the relationship between foreground and background information, our approach improves the model’s ability to discern subtle discrepancies that may indicate forgery, leading to superior performance in IFL tasks. Next, we introduce an edge-guided feature reconstruction module and pixel-wise contrastive supervision used in complementary learning.

(2) Edge-guided feature reconstruction: Although both foreground and background features are rich in contextual information, their high correlation at the outset can result in data redundancy when directly applied to supervised learning using their complementary attributes. To address this, a thoughtful reconstruction of these complementary features is necessary to drive complementary learning. Therefore, we advocate an innovative edge-guided feature-reconstruction method that delves deeper into the representational insight of these two original complementary features. Leveraging consistent edge features between the foreground and background, we employ accurate edge priors to guide the dual reconstruction process. Specifically, we process a foreground feature,

F^{f}

, or a background feature,

F^{b}

, with dilated convolution groups at a dilation rate of

d \in {1, 2, 3, 4}

and then splice the extracted edge feature graph,

e_{o u t}

, on each processed feature to enhance the perception ability of the model to the edge of the tampered region. As shown in Figure 5, the Split concatenation operation we introduced consists of two parts, Split and Insert:

\begin{array}{l} S t e p 1 : S p l i t (A) \to {A_{1}, \dots, A_{k}}, \\ S t e p 2 : I n s e r t ({A_{1}, B}, \dots, {A_{k}, B}) . \end{array}

(8)

Our edge-guided feature-reconstruction strategy transcends conventional, perspective-limited processing by enriching edge definitions across various receptive fields. Specifically, we integrate edge priors extracted through EE into features obtained from different dilated convolution operations. This multifaceted approach boosts the model’s boundary sensitivity from multiple angles, simultaneously expanding its receptive field for a more nuanced detection capability. The above process can be described as follows:

e f_{i}^{f} = S I (C B R_{3 \times 3}^{d = i} (F^{f}), e_{o u t}), i = 1, 2, 3, 4,

(9)

e f_{i}^{b} = S I (C B R_{3 \times 3}^{d = i} (F^{b}), e_{o u t}), i = 1, 2, 3, 4 .

(10)

where

e f_{i}^{f}

denotes the edge-enhanced foreground feature across divergent receptive fields, while

e f_{i}^{b}

denotes the background feature which has been enhanced similarly. Subsequently, an aggregation of edge-enhanced features from various receptive fields is performed to yield the edge-enhanced foreground and background:

e f^{f} = A d d_{i = 1}^{4} (e f_{i}^{f}),

(11)

e f^{b} = A d d_{i = 1}^{4} (e f_{i}^{b}) .

(12)

where

A d d_{k = 1}^{4}

denotes the element addition of all four features enhanced by the edge. Then, we introduce the multi-scale channel attention (MSCA) mechanism [43] to improve the extraction and refinement of multi-level and multi-scale information within features,

e f^{f}

and

e f^{b}

. The MSCA mechanism excels at handling targets of differing scales. Unlike conventional multi-scale attention methods, MSCA utilizes point-wise convolutions across dual branches to compress and dilate the features along the channel dimensions efficiently. This technique enables the network to emphasize large-scale objects within a broader context while focusing on small-scale objects within a narrower frame, effectively integrating multi-scale contextual information across channels. Therefore, we feed the fusion feature,

e f^{f}

or

e f^{b}

, into the MSCA component for further processing. Then, element-wise multiplication is adopted to fuse the outputs of MSCA and the corresponding input features. Finally, a residual structure is used with

C o n v_{3 \times 3}

and

σ

to obtain a final predicted map,

f^{f}

or

f^{b}

. The above process can be described as follows:

f_{e}^{f} = σ (C o n v_{3 \times 3} (f^{f} + M S C A (e f^{f}) \times e f^{f})),

(13)

f_{e}^{b} = σ (C o n v_{3 \times 3} (f^{b} + M S C A (e f^{b}) \times e f^{b})) .

(14)

where

M S C A (\cdot)

refers to the MSCA function mentioned above.

(3) Pixel-wise contrastive supervision: In IFL tasks, enhancing the contrast between the tampered and authentic regions and maintaining the uniform distribution of these two regions can significantly improve the model’s localization accuracy. Moreover, augmenting the contrast between the tampered and authentic regions of foreground and background features can enhance the initial feature reconstruction, thereby further boosting the efficiency of complementary learning. To this end, we propose a novel pixel-wise contrastive supervision (PCS) approach that treats the foreground and background as two distinct clusters.

This method uses a pair of complementary masks for contrastive supervision to learn their similarities and differences. To simplify subsequent computations, we resize

F^{f}

or

F^{b}

to

64 \times 64

and assign a label to each pixel based on the corresponding true mask. We then construct positive and negative pairs in

F^{f}

or

F^{b}

and their corresponding ground truths, which will be used for ensuing contrastive learning. Given that [44] indicates that false negative pairs are not beneficial for contrastive learning, we disregard the interaction between authentic regions for the foreground features to mitigate the negative impact of false negative pairs. We only consider tampered pixels in the mask and tampered pixels in the features to be positive pairs. In contrast, tampered pixels in the mask with authentic pixels in the features and authentic pixels in the mask with tampered pixels in the features are considered negative pairs. (The design of positive and negative pairs of background features is diametrically opposite to that of the foreground features.) This method is illustrated in Figure 6. For the foreground features, we propose the following contrastive loss,

L_{C L}^{f}

:

L_{C L}^{f} = - \frac{1}{| N |} \sum_{i} \log \frac{\exp (\frac{g t_{i}^{t} \cdot f g_{i}^{t}}{τ})}{\exp (\frac{g t_{i}^{t} \cdot f g_{i}^{t}}{τ}) + \sum_{j} \exp (\frac{g t_{i}^{t} \cdot f g_{j}^{a}}{τ}) + \sum_{q} \exp (\frac{g t_{i}^{a} \cdot f g_{q}^{t}}{τ})} .

(15)

where the similarity between the feature and mask is measured by a dot product.

g t_{i}^{t}

and

g t_{i}^{a}

denote the tamper pixel and authentic pixel in the ground truth, respectively.

f g_{i}^{t}

is the pixel in

F^{f}

that has the same label as

g t_{i}^{t}

, forming a positive pair with

g t_{i}^{t}

and a negative pair with

g t_{i}^{a}

.

f g_{j}^{a}

is a pixel with the same label as

g t_{i}^{a}

and forms a negative pair with

g t_{i}^{t}

.

N

is the number of positive pairs in the foreground feature.

τ

is the hyperparameter of the temperature. Similarly, for the background features, we propose a contrastive loss,

L_{C L}^{b}

, as follows:

L_{C L}^{b} = - \frac{1}{| N |} \sum_{i} \log \frac{\exp (\frac{g t_{i}^{a} \cdot b g_{i}^{a}}{τ})}{\exp (\frac{g t_{i}^{a} \cdot b g_{i}^{a}}{τ}) + \sum_{j} \exp (\frac{g t_{i}^{a} \cdot b g_{j}^{t}}{τ}) + \sum_{q} \exp (\frac{g t_{i}^{t} \cdot b g_{q}^{a}}{τ})} .

(16)

where

b g_{j}^{t}

and

b g_{i}^{a}

denote the tamper pixel and authentic pixel in

F^{b}

, respectively. Our independent contrastive supervision approach in complementary learning delves deeper into the intrinsic attributes of the foreground and background, enhancing the model’s precision in discerning and differentiating their complementary data. This nuanced comprehension of feature complementarity and distinction equips the model to detect more intricate forms of tampering, including attempts that mimic the characteristics of the foreground or background to conceal manipulation.

3.6. Loss Function

Our model employs three types of supervision: the mask supervision of the tampered object’s foreground and background (

f_{g t}

and

b_{g t}

); supervision of the tampered object’s edge,

e_{g t}

; and the proposed pixel-wise contrastive supervision. Given that the number of tampered pixels and the edge pixels of tampered objects are typically small, we use dice losses [45] to counter the strong imbalance between positive and negative samples. It is worth noting that both mask supervision and contrastive supervision are applied to both foreground and background objects from IFL. Therefore, the total loss is defined as follows:

L_{t o t a l} = δ L_{D i c e}^{f} (f_{o u t}, f_{g t}) + λ L_{D i c e}^{b} (b_{o u t}, b_{g t}) + μ L_{D i c e}^{e} (e_{o u t}, e_{g t}) + L_{C L}^{f} + L_{C L}^{b} .

(17)

where

L_{D i c e}^{f} (\cdot)

,

L_{D i c e}^{b} (\cdot)

, and

L_{D i c e}^{e} (\cdot)

denote foreground, background, and edge dice losses, respectively.

f_{g t}

and

b_{g t}

refer to the ground-truth mask of the foreground and background, respectively;

e_{g t}

refers to the edge image; and

δ

,

λ

, and

μ

are the weights of the foreground, background, and edge losses, respectively. During the training phase, we balance the contributions of both the foreground and background information by adjusting the weight assigned to

δ

and

λ

.

4. Experiments

4.1. Experimental Settings

(1) Datasets: We assessed the performance of the proposed CECL-Net using a total of five datasets, which encompass two widely used public manipulation datasets (CASIA2 [46] and NIST [47]), one AI-based manipulation dataset (IPM [48]) and two realistic challenge datasets (IMD [49] and WILD [50]). We selected the corresponding number of images from each dataset as train and test sets. For a fair comparison, we adopted the most popular training–testing splitting configurations in CASIA, NIST, and PSCC [22]. For the IPM dataset, we utilized 1000 specific test sets that were openly shared by the original authors. However, we could demonstrate our model’s exceptional performance by using only a subset of 6000 images as the training set, which is significantly smaller than the 15K images provided by the original authors. Regarding the two realistic challenge datasets, IMD and WILD, we adopted an 8:2 split ratio for training and testing, respectively. The specific allocation of images to these sets is outlined in Table 1. In this study, to focus exclusively on evaluating our model’s performance, we refrained from pretraining it on a synthetic manipulation dataset. Remarkably, our model surpassed the baseline models without relying on any extensive synthetic data for pretraining.

(2) Implementation details: In this paper, CECL-Net was implemented in PyTorch and optimized on a single NVIDIA GeForce RTX 4090 using the Adam [51] optimizer. The learning rate was set to

1 \times 10^{- 4}

, the batch size was set to 8, and the maximum epoch was set to 100. All input images were resized to

416 \times 416

throughout the training phase. In our loss function, the foreground loss weight is

δ = 0.1

, the background loss weight is

λ = 0.3

, and the edge loss is

μ = 0.6

. The detailed experimental settings are shown in Table 2.

(3) Performance metrics: In this paper, we assessed the performance of our CECL-Net using three metrics: the F1 score (F1), Intersection over Union (IoU), and Area Under the receiver operating characteristic Curve (AUC). The F1 score combines precision and recall into a single measure. Its spectrum ranges from 0, signifying the worst possible performance, to 1, denoting optimal performance. IoU evaluates the overlap between predicted and actual results; a higher value represents a better segmentation accuracy of the model. AUC measures the overall model performance, with a higher value indicating a superior ability to distinguish between classes.

4.2. Experimental Results

(1) Quantitative evaluations: As illustrated in Table 3, we collated seven state-of-the-art (SOTA) algorithms for comparison against our proposed CECL-Net. CECL-Net consistently outperformed all other models across three measures on five different test sets, signifying a remarkable advancement even when juxtaposed with the second-highest performing models. Our model demonstrates exceptional superiority, particularly on the two realistic challenge datasets, underscoring its robust generalization ability. In particular, using the IPM test as an illustration, CECL-Net results in 0.144, 0.145, and 0.072 improvements in F1, IoU, and AUC values, respectively, compared with the second-best performance in all seven SOTA results. Furthermore, our model retains its superior performance across a majority of datasets, even when the existing backbone is replaced with the SOTA backbone. Exceptionally, when our backbone changed to ResNet-50, our performance did not match the latest CFL-Net on certain datasets. We hypothesize that this may be due to CFL-Net’s incorporation of ASPP modules after employing dual ResNet-50 architectures, which likely enhances the model’s encoding capabilities.

(2) Qualitative evaluation: Figure 7 visually compares seven popular IFL baselines and our CECL-Net. Here, we picked a challenging sample from each of the five datasets. Owing to the superiority of our proposed complementary learning strategy, CECL-Net can better deal with complex tampered images and can obtain prediction maps with a more refined edge contour of the tampered object. In the case of test images from the IMD dataset, only our method could accurately predict the exact outline of the guitar. Additionally, only ours could accurately predict the child spliced onto the image from the WILD dataset sample.

(3) Ablation study: In this study, we executed a sequence of meticulously structured ablation studies to assess the effectiveness of each proposed module. Additionally, we conceived an autonomous ablation study explicitly tailored to our innovative complementary learning strategy. We designed seven schemes to show the experimental results more intuitively, as shown in Table 4. EE, DF, EGFR, and PCS represent the abovementioned modules (See Section 3). BS represents the supervised learning method using a background feature that is complementary to the foreground feature. We use concatenation instead of the fusion operation and a

3 \times 3

convolution instead of EGFR. The qualitative results of the ablation studies are depicted in Figure 8. Upon examining Figure 8 (Scheme B), which does not incorporate DF, it is evident that the target localization capability is significantly compromised due to the absence of a multi-scale feature-mining process and a smooth fusion strategy. This undesirable result underscores the importance of our proposed DF. Only utilizing foreground and background features to realize complementary supervised learning (Figure 8 (Scheme D)) or ignoring BS and solely employing PCS (Figure 8 (Scheme E)) enhances the prediction performance in comparison to using neither (Figure 8 (Scheme C)). However, their combined use (Figure 8 (Scheme G)) facilitates a precise localization, embodying their synergistic effect. The absence of EGFR (Figure 8 (Scheme F)) results in a degradation of the localization performance due to the high correlation between foreground and background features and the prior lack of proper fusion between edge information and features.

Compared with Scheme A of Figure 8, Scheme F of Figure 8 does not significantly improve the accuracy of the positioning results. To further verify the effectiveness of our proposed EE, we compared it with two widely used edge strategies on the NIST and IMD datasets. The first is the edge-extraction strategy adopted in MVSS-Net [21], that is, the edge-supervision module is used to gradually combine the features of different ResNet blocks for edge detection, and the Sobel layer is introduced to enhance edge information. The second is the edge-extraction strategy in ET-Net [52]: directly splice Sobel-processed features of each layer to extract edge features. As shown in Table 5, EE is superior to the two methods compared with it. The F1, IoU, and AUC metrics on the NIST dataset are 0.008, 0.021, and 0.009, higher than the second-best edge-extraction method. The visual results depicted in Figure 9 demonstrate that our method can extract edge details more precisely than others. The edge features extracted by other methods either lack critical detail or contain excessive noise information, negatively impacting subsequent edge guidance. Regarding the quantitative results, our method outperformed other methods on all three measures in the NIST and IMD datasets.

To validate the efficacy of our proposed pixel-level contrastive supervision strategy, we conducted a comparison with the contrastive learning strategy used in CFL-Net [23] on the NIST and IMD datasets. The contrastive learning strategy in CFL focuses on querying adjacent pixels to gain contextual knowledge but overlooks the impact of false negative pairs. As shown in Table 6, our contrastive supervision method is superior to the contrastive supervision method in CFL-Net. As illustrated in Figure 10, when confronted with tampered images that are challenging to detect, the CFL-Net’s contrastive learning strategy yields subpar results, whereas our method generates relatively more accurate predictions. Therefore, it is crucial to be vigilant about the adverse impacts of false negative pairs on pixel-wise contrastive supervision. Regarding the quantitative results, our method outperforms the method used in CFL-Net on all metrics of the NIST and IMD datasets.

(4) Computational complexity analyses: The proposed CECL-Net was compared with popular baselines (MVSS-Net++ and CFL-Net), which have an available source code for computational complexity analyses, and the results are presented in Table 7. These methods uniformly use 416 × 416 forged images as the input in a single NVIDIA GeForce RTX 4090 GPU, where the training time is the sum of 10 epochs on the NIST16 dataset, and the inference time is the average of 1000 randomly selected samples.

Compared to CFL-Net and MVSS-Net++, our model requires slightly more training and inference times than MVSS-NET++ but less than those of CFL-Net. CECL-Net employs DF for a multi-stage feature fusion, which does increase computational costs to some extent. However, both CFL-Net and MVSS-Net are two-branch networks, while CECL-Net’s innovative design utilizes a single backbone network to extract foreground and background information from the image, significantly reducing the model’s computational costs. In summary, our method maintains commendable and practically usable efficiency while achieving the highest accuracy.

(5) Robustness evaluation: In this study, we comprehensively evaluated our model’s robustness, utilizing the NIST dataset and focusing on four prominent daily image processing operations: JPEG compressions, Gaussian blurring, image scaling, and image sharpening. The final experimental results are shown in Figure 11. Note that we used the F1 score to measure the performance of various models. Next, we performed a detailed analysis of the resistance on each dataset.

Robustness against JPEG compressions: Generally, tampered images are subjected to a JPEG compression to lessen the discrepancy between the tampered and untouched regions. For this reason, our model needs to exhibit robust resilience to JPEG compressions. As depicted in Figure 11a, comparing the resilience of different methods to JPEG compressions is noteworthy. MFI-Net demonstrates the highest detection performance and remains unaffected by JPEG performance. CECL-Net is prone to reduced performance under a JPEG compression, with a notable decline being observed once the compression ratio exceeds 70. Despite this setback, this model maintains performance levels, surpassing most other methods after degradation. Consequently, it is crucial for future research to develop strategies that bolster the robustness of models against the effects of JPEG compressions.

Robustness against Gaussian filtering: Tampering activities leave unavoidable marks, and individuals may attempt to mask them using Gaussian filtering. The resistance of eight different techniques to Gaussian filtering is depicted in Figure 11b. Based on the final experimental results, our proposed method particularly shines with its high-quality output while maintaining a stable resistance to Gaussian filtering. Consequently, compared to other methods, CECL-Net exhibits greater robustness to Gaussian filtering.

Robustness against image scaling: During the transmission of images, resizing is an inevitable process to ensure compatibility across diverse platforms. Consequently, implementing strategies to maintain image integrity against scaling effects is essential. Figure 11c illustrates the resistance of eight approaches to image scaling. CECL-Net proves to be highly resistant to image enlargements, showing only a slight dip in performance. Although CECL-Net experiences a marginal performance drop during image reductions, its effectiveness remains notably superior. On the other hand, PSCC-Net and Mantra-Net stand out for their substantial tolerance to image scaling, showcasing their consistent performance in the face of such adjustments.

Robustness against image sharpening: Image sharpening is a common technique used in the post-processing of tampered images. It can improve the integration between the tampered region and the original image, enhance the overall visual quality, and make tampering more difficult to detect. Figure 11d shows that most methods are insensitive to image sharpening. The method introduced in this paper exhibits superiority compared to competing approaches.

(6) Failure case analysis: IFL is a complex task, where encountering failures is common. This study aims to examine such instances meticulously. Take, for example, the first row of Figure 12, which depicts a case of tampering that blends seamlessly into the surrounding environment, effectively concealing any alterations and presenting a considerable challenge to existing detection algorithms. In the second row, the detection error could stem from the proximity of semantically similar objects, where the presence of a smaller tampered object adjacent to a larger, analogous one may lead the model focusing on the dominant features of the larger object, thus overlooking subtle alterations. The error in the third row is attributable to the diminutive size of the tampered object, which remains a persistent hurdle for current detection models to overcome.

5. Conclusions

This paper introduces a novel contrastive learning and edge-reconstruction-driven complementary learning network (CECL-Net) to solve the image forgery problem. CECL-Net enhances our understanding of manipulated images by leveraging foreground and background features. An improved pixel-wise contrastive supervision strategy and a novel edge-guided feature-reconstruction module are proposed to drive complementary learning. More crucially, we introduced an innovative dense fusion and an Edge Detector to assist in complementary learning. CECL-Net, tested on two benchmark datasets, one AI-manipulated dataset and two real challenge datasets, consistently achieved state-of-the-art results regarding generalizations across datasets, individual dataset performances, and robustness against post-processing attacks. We hope that the proposed CECL-Net will catalyze progress in IFL, as well as other detection or segmentation endeavors.

Author Contributions

Conceptualization, G.D., K.C. and L.H.; data curation, L.C. and D.A.; investigation, G.D. and Z.W.; methodology, G.D. and K.C.; writing—original draft, G.D.; writing—review and editing, K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author (the data are not publicly available due to privacy restraints).

Conflicts of Interest

Author Gaoyuan Dai, Kai Chen, Linjie Huang, Longru Chen, Dongping An and Zhe Wang were employed by China Telecom Stocks Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ding, H.; Chen, L.; Tao, Q.; Fu, Z.; Dong, L.; Cui, X. DCU-Net: A dual-channel U-shaped network for image splicing forgery detection. Neural Comput. Appl. 2023, 35, 5015–5031. [Google Scholar] [CrossRef] [PubMed]
Wei, Y.; Ma, J.; Wang, Z.; Xiao, B.; Zheng, W. Image splicing forgery detection by combining synthetic adversarial networks and hybrid dense U-net based on multiple spaces. Int. J. Intell. Syst. 2022, 37, 8291–8308. [Google Scholar] [CrossRef]
Xiao, B.; Wei, Y.; Bi, X.; Li, W.; Ma, J. Image splicing forgery detection combining coarse to refined convolutional neural network and adaptive clustering. Inf. Sci. 2020, 511, 172–191. [Google Scholar] [CrossRef]
Zhang, Y.; Zhu, G.; Wu, L.; Kwong, S.; Zhang, H.; Zhou, Y. Multi-task SE-network for image splicing localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4828–4840. [Google Scholar] [CrossRef]
Chen, B.; Tan, W.; Coatrieux, G.; Zheng, Y.; Shi, Y.-Q. A serial image copy-move forgery localization scheme with source/target distinguishment. IEEE Trans. Multimed. 2020, 23, 3506–3517. [Google Scholar] [CrossRef]
Xiong, L.; Xu, J.; Yang, C.-N.; Zhang, X. CMCF-Net: An End-to-End Context Multiscale Cross-Fusion Network for Robust Copy-Move Forgery Detection. IEEE Trans. Multimed. 2023, 26, 6090–6101. [Google Scholar] [CrossRef]
Weng, S.; Zhu, T.; Zhang, T.; Zhang, C. UCM-Net: A U-Net-like tampered-region-related framework for copy-move forgery detection. IEEE Trans. Multimed. 2023, 26, 750–763. [Google Scholar] [CrossRef]
Zhu, Y.; Chen, C.; Yan, G.; Guo, Y.; Dong, Y. AR-Net: Adaptive attention and residual refinement network for copy-move forgery detection. IEEE Trans. Ind. Inform. 2020, 16, 6714–6723. [Google Scholar] [CrossRef]
Wu, Y.; Abd-Almageed, W.; Natarajan, P. Busternet: Detecting copy-move image forgery with source/target localization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 168–184. [Google Scholar]
Yang, F.; Li, J.; Lu, W.; Weng, J. Copy-move forgery detection based on hybrid features. Eng. Appl. Artif. Intell. 2017, 59, 73–83. [Google Scholar] [CrossRef]
Zhu, X.; Qian, Y.; Zhao, X.; Sun, B.; Sun, Y. A deep learning approach to patch-based image inpainting forensics. Signal Process. Image Commun. 2018, 67, 90–99. [Google Scholar] [CrossRef]
Zhang, Y.; Ding, F.; Kwong, S.; Zhu, G. Feature pyramid network for diffusion-based image inpainting detection. Inf. Sci. 2021, 572, 29–42. [Google Scholar] [CrossRef]
Zhu, X.; Lu, J.; Ren, H.; Wang, H.; Sun, B. A transformer–CNN for deep image inpainting forensics. Vis. Comput. 2022, 39, 4721–4735. [Google Scholar] [CrossRef]
Yadav, A.; Vishwakarma, D.K. AW-MSA: Adaptively weighted multi-scale attentional features for DeepFake detection. Eng. Appl. Artif. Intell. 2024, 127, 107443. [Google Scholar] [CrossRef]
Tolosana, R.; Romero-Tapiador, S.; Vera-Rodriguez, R.; Gonzalez-Sosa, E.; Fierrez, J. DeepFakes detection across generations: Analysis of facial regions, fusion, and performance evaluation. Eng. Appl. Artif. Intell. 2022, 110, 104673. [Google Scholar] [CrossRef]
Zhou, T.; Wang, W.; Liang, Z.; Shen, J. Face forensics in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5778–5788. [Google Scholar]
Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Learning rich features for image manipulation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1053–1061. [Google Scholar]
Wu, Y.; AbdAlmageed, W.; Natarajan, P. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9543–9552. [Google Scholar]
Li, F.; Pei, Z.; Zhang, X.; Qin, C. Image Manipulation Localization Using Multi-Scale Feature Fusion and Adaptive Edge Supervision. IEEE Trans. Multimed. 2022, 25, 7851–7866. [Google Scholar] [CrossRef]
Xia, X.; Su, L.C.; Wang, S.P.; Li, X.Y. DMFF-Net: Double-stream multilevel feature fusion network for image forgery localization. Eng. Appl. Artif. Intell. 2024, 127, 107200. [Google Scholar] [CrossRef]
Dong, C.; Chen, X.; Hu, R.; Cao, J.; Li, X. Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3539–3553. [Google Scholar] [CrossRef]
Liu, X.; Liu, Y.; Chen, J.; Liu, X. PSCC-Net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7505–7517. [Google Scholar] [CrossRef]
Niloy, F.F.; Bhaumik, K.K.; Woo, S.S. CFL-Net: Image forgery localization using contrastive learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 4642–4651. [Google Scholar]
Hou, S.; Liu, X.; Wang, Z. Dualnet: Learn complementary features for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 502–510. [Google Scholar]
Li, Y.; Chen, X.; Zhu, Z.; Xie, L.; Huang, G.; Du, D.; Wang, X. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7026–7035. [Google Scholar]
Xu, D.; Shen, X.; Lyu, Y. UP-Net: Uncertainty-supervised Parallel Network for Image Manipulation Localization. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6390–6403. [Google Scholar] [CrossRef]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4320–4328. [Google Scholar]
Xu, D.; Shen, X.; Lyu, Y.; Du, X.; Feng, F. MC-Net: Learning mutually-complementary features for image manipulation localization. Int. J. Intell. Syst. 2022, 37, 3072–3089. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9729–9738. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; Van Gool, L. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7303–7313. [Google Scholar]
Hu, H.; Cui, J.; Wang, L. Region-aware contrastive learning for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16291–16301. [Google Scholar]
Sun, K.; Yao, T.; Chen, S.; Ding, S.; Li, J.; Ji, R. Dual contrastive learning for general face forgery detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; pp. 2316–2324. [Google Scholar]
Zhao, J.-X.; Cao, Y.; Fan, D.-P.; Cheng, M.-M.; Li, X.-Y.; Zhang, L. Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3927–3936. [Google Scholar]
Lin, X.; Wang, S.; Deng, J.; Fu, Y.; Bai, X.; Chen, X.; Qu, X.; Tang, W. Image manipulation detection by multiple tampering traces and edge artifact enhancement. Pattern Recognit. 2023, 133, 109026. [Google Scholar] [CrossRef]
Guo, X.; Liu, X.; Ren, Z.; Grosz, S.; Masi, I.; Liu, X. Hierarchical fine-grained image forgery detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3155–3165. [Google Scholar]
Shi, Z.; Chen, H.; Zhang, D. Transformer-auxiliary neural networks for image manipulation localization by operator inductions. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4907–4920. [Google Scholar] [CrossRef]
Menon, V.; Uddin, L.Q. Saliency, switching, attention and control: A network model of insula function. Brain Struct. Funct. 2010, 214, 655–667. [Google Scholar] [CrossRef] [PubMed]
Luck, S.J.; Chelazzi, L.; Hillyard, S.A.; Desimone, R. Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophysiol. 1997, 77, 24–42. [Google Scholar] [CrossRef] [PubMed]
Stevens, M.; Merilaita, S. Animal camouflage: Current issues and new perspectives. Philos. Trans. R. Soc. B Biol. Sci. 2009, 364, 423–427. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar]
Huynh, T.; Kornblith, S.; Walter, M.R.; Maire, M.; Khademi, M. Boosting contrastive self-supervised learning with false negative cancellation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2785–2795. [Google Scholar]
Wei, Q.; Li, X.; Yu, W.; Zhang, X.; Zhang, Y.; Hu, B.; Mo, B.; Gong, D.; Chen, N.; Ding, D. Learn to segment retinal lesions and beyond. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7403–7410. [Google Scholar]
Dong, J.; Wang, W.; Tan, T. Casia image tampering detection evaluation database. In Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013; pp. 422–426. [Google Scholar]
Nist: Nimble Media Forensics Challenge Datasets. 2016. Available online: https://www.nist.gov/itl/iad/mig (accessed on 28 September 2024).
Ren, R.; Hao, Q.; Niu, S.; Xiong, K.; Zhang, J.; Wang, M. MFI-Net: Multi-feature Fusion Identification Networks for Artificial Intelligence Manipulation. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1266–1280. [Google Scholar] [CrossRef]
Novozamsky, A.; Mahdian, B.; Saic, S. IMD2020: A large-scale annotated dataset tailored for detecting manipulated images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, Snowmass Village, CO, USA, 12–15 March 2018; pp. 71–80. [Google Scholar]
Huh, M.; Liu, A.; Owens, A.; Efros, A.A. Fighting fake news: Image splice detection via learned self-consistency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Sun, Y.; Ni, R.; Zhao, Y. ET: Edge-enhanced transformer for image splicing detection. IEEE Signal Process. Lett. 2022, 29, 1232–1236. [Google Scholar] [CrossRef]

Figure 1. Three challenging images for image forgery localization (IFL). The images, from left to right, are (a) input images, (b) CECL-Net without background supervision, (c) CECL-Net, and (d) ground truths.

Figure 2. Overview of contrastive learning and edge-reconstruction-driven complementary learning network (CECL-Net): Pyramid Vision Transformer (PVT) for feature extraction (Section 3.2); dense fusion (DF) for multi-scale feature fusion (Section 3.3); edge extractor (EE) for edge-artifact extraction (Section 3.4); edge-guided feature reconstruction (EGFR) for foreground and background feature reconstruction (Section 3.5).

Figure 3. Illustration of dense fusion (DF). DF consists of two components (Section 3.3): a multi-scale feature extractor (MFE) and the dense interaction of multi-layer-features. The initial stage concentrates on multi-scale feature extraction, while the second stage involves the dense interaction–fusion of multi-layer features.

Figure 4. Illustration of edge extractor (EE). EE uses a multi-scale strategy to obtain accurate edge artifacts (Section 3.4).

Figure 5. Illustration of edge-guided feature reconstruction (EGFR). EGFR uses edge artifacts extracted by EE to guide the reconstruction of complementary foreground and background features (Section 3.5).

Figure 6. Illustration of pixel-wise contrastive supervision (PCS). PCS enhances the contrast between the tampered and the real regions and maintains the uniform distribution of the tampered and the real regions (Section 3.5).

Figure 7. Qualitative results of different IFL methods. We present the qualitative results of seven popular methods under five challenging scenes selected from different datasets (Section 4.2.(2)).

Figure 8. Qualitative results of different schemes. Compared to CECL-Net, the lack of any module will cause performance degradation. Further details can be found in Section 4.2.(3).

Figure 9. A visualization of the effectiveness of different edge-extraction strategies. In_MVSS stands for the edge-extraction method in MVSS-Net, and In_ET refers to the edge-extraction method in ET-Net.

Figure 10. A visualization of the effectiveness of different contrastive supervision strategies. In_CFL stands for the contrastive supervision method in CFL-Net.

Figure 11. The results of the robustness analysis of the proposed method and the other seven SOTA methods on the NIST dataset. (a–d) showcase different robustness tests: (a) compares the resilience of various methods against JPEG compression, (b) examines their resistance to Gaussian filtering, (c) evaluates performance under image scaling, and (d) assesses sensitivity to image sharpening.

Figure 12. Failure cases. Data source: IMD.

Table 1. Train set and test set. The numbers in this table indicate the count of tampered images.

	CASIA	NIST	IPM	IMD	WILD
Train set	5123 (v2.0)	404	6000	1610	161
Test set	921 (v1.0)	160	1000	400	40

Table 2. Details of the overall implementation.

Settings
Framework	Hardware	Optimizer	Learning Rate	Batch Size	Epoch	Image Size	Foreground Loss Weight	Background Loss Weight	Edge Loss Weight
PyTorch	NVIDIA GeForce RTX 4090	Adam	$1 \times 10^{- 4}$	8	100	$416 \times 416$	$δ = 0.1$	$λ = 0.3$	$μ = 0.6$

Table 3. Quantitative results of CECL-Net compared with seven state-of-the-art methods under five widely used benchmark datasets. The best results are in red. The second-best results are in bold.

Distribution Function	Backbone	CASIA			NIST			IPM			IMD			WILD
Distribution Function	Backbone	F1	IoU	AUC	F1	IoU	AUC	F1	IoU	AUC	F1	IoU	AUC	F1	IoU	AUC
Mantra-Net, CVPR 19 [18]	—	0.396	0.331	0.701	0.751	0.647	0.798	0.426	0.340	0.725	0.282	0.201	0.533	0.362	0.259	0.657
SE-Net, TCSVT 21 [4]	—	0.434	0.376	0.756	0.766	0.684	0.879	0.443	0.361	0.709	0.308	0.249	0.601	0.390	0.286	0.665
PSCC-Net, TCSVT 22 [22]	HRNet	0.543	0.464	0.834	0.807	0.710	0.899	0.571	0.469	0.757	0.416	0.323	0.667	0.463	0.351	0.700
HDU-Net, IJIS 22 [2]	—	0.405	0.361	0.738	0.814	0.713	0.904	0.543	0.430	0.738	0.382	0.311	0.651	0.455	0.330	0.695
MVSS-Net++, TPAMI 23 [21]	ResNet-50	0.539	0.456	0.824	0.835	0.749	0.923	0.618	0.546	0.826	0.391	0.357	0.660	0.477	0.358	0.712
CFL-Net, WACV 23 [23]	ResNet-50	0.648	0.582	0.860	0.862	0.778	0.933	0.603	0.531	0.832	0.412	0.347	0.717	0.560	0.441	0.738
MFI-Net, TCSVT 23 [48]	Res2Net-50	0.602	0.549	0.836	0.876	0.799	0.929	0.657	0.571	0.823	0.451	0.391	0.725	0.529	0.421	0.725
Ours	Res2Net-50	0.628	0.562	0.851	0.907	0.869	0.953	0.711	0.640	0.857	0.446	0.384	0.732	0.597	0.465	0.759
Ours	Pvt	0.710	0.643	0.875	0.921	0.892	0.959	0.801	0.716	0.895	0.571	0.502	0.778	0.647	0.545	0.767

Table 4. Configurations of the ablation study. The best results are in bold. √ represents the use of the corresponding module.

Scheme	EE	DF	EFGR	BS	PCS	CASIA			NIST			IPM			IMD			WILD
Scheme	EE	DF	EFGR	BS	PCS	F1	IoU	AUC	F1	IoU	AUC	F1	IoU	AUC	F1	IoU	AUC	F1	IoU	AUC
A		√		√	√	0.697	0.641	0.855	0.901	0.851	0.944	0.789	0.721	0.906	0.543	0.479	0.770	0.633	0.534	0.779
B	√		√	√	√	0.693	0.634	0.857	0.907	0.866	0.936	0.795	0.729	0.900	0.595	0.523	0.796	0.579	0.471	0.737
ablation study of proposed complementary learning strategy
C	√	√	√			0.682	0.626	0.849	0.910	0.863	0.947	0.782	0.712	0.889	0.593	0.522	0.792	0.575	0.456	0.734
D	√	√	√	√		0.697	0.640	0.851	0.926	0.895	0.959	0.786	0.718	0.895	0.586	0.511	0.789	0.566	0.451	0.732
E	√	√	√		√	0.693	0.639	0.850	0.899	0.851	0.941	0.788	0.716	0.893	0.587	0.510	0.784	0.622	0.517	0.768
F	√	√		√	√	0.699	0.643	0.853	0.902	0.857	0.950	0.784	0.710	0.895	0.547	0.474	0.779	0.636	0.539	0.775
G	√	√	√	√	√	0.710	0.643	0.875	0.921	0.892	0.959	0.801	0.716	0.895	0.571	0.502	0.778	0.647	0.545	0.767

Table 5. Quantitative results of different edge-extraction strategies. The best results are in bold.

Strategy	NIST			IMD
Strategy	F1	IoU	AUC	F1	IoU	AUC
In_MVSS	0.908	0.864	0.948	0.559	0.487	0.760
In_ET	0.913	0.871	0.950	0.568	0.492	0.764
Ours	0.921	0.892	0.959	0.571	0.502	0.778

Table 6. Quantitative results of different contrastive supervision strategies. The best results are in bold.

Strategy	NIST			IMD
Strategy	F1	IoU	AUC	F1	IoU	AUC
In_CFL	0.902	0.874	0.941	0.556	0.493	0.762
Ours	0.921	0.892	0.959	0.571	0.502	0.778

Table 7. Training time and inference time compared with seven state-of-the-art methods. “ms” represents milliseconds, and “min” denotes minutes.

Strategy	Training Time (min)	Inference Time (ms)
MVSS-Net++ [21]	5.75	18.83
CFL-Net [23]	7.54	23.92
Ours	6.33	20.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, G.; Chen, K.; Huang, L.; Chen, L.; An, D.; Wang, Z.; Wang, K. CECL-Net: Contrastive Learning and Edge-Reconstruction-Driven Complementary Learning Network for Image Forgery Localization. Electronics 2024, 13, 3919. https://doi.org/10.3390/electronics13193919

AMA Style

Dai G, Chen K, Huang L, Chen L, An D, Wang Z, Wang K. CECL-Net: Contrastive Learning and Edge-Reconstruction-Driven Complementary Learning Network for Image Forgery Localization. Electronics. 2024; 13(19):3919. https://doi.org/10.3390/electronics13193919

Chicago/Turabian Style

Dai, Gaoyuan, Kai Chen, Linjie Huang, Longru Chen, Dongping An, Zhe Wang, and Kai Wang. 2024. "CECL-Net: Contrastive Learning and Edge-Reconstruction-Driven Complementary Learning Network for Image Forgery Localization" Electronics 13, no. 19: 3919. https://doi.org/10.3390/electronics13193919

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CECL-Net: Contrastive Learning and Edge-Reconstruction-Driven Complementary Learning Network for Image Forgery Localization

Abstract

1. Introduction

2. Related Works

2.1. Complementarity Learning

2.2. Contrastive Learning

2.3. Image-Forgery Localization

3. Methods

3.1. Network Overview

3.2. Pyramid Vision Transformer

3.3. Dense Fusion

3.4. Edge Extractor

3.5. Proposed Complementary Learning Strategy

3.6. Loss Function

4. Experiments

4.1. Experimental Settings

4.2. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI