URRL-IMVC: Unified and Robust Representation Learning for Incomplete Multi-View Clustering

Ge Teng [email protected] 0000-0002-1331-9868 Zhejiang UniversityHangzhouChina Ting Mao [email protected] 0009-0001-9531-6328 Alibaba CloudHangzhouChina Chen Shen [email protected] 0000-0002-7534-0830 Alibaba CloudHangzhouChina Xiang Tian [email protected] 0000-0003-0735-8454 Zhejiang UniversityHangzhouChina Zhejiang University Embedded System Engineering Research Center, Ministry of Education of ChinaHangzhouChina Xuesong Liu [email protected] 0000-0001-8549-0368 Zhejiang UniversityHangzhouChina Zhejiang University Embedded System Engineering Research Center, Ministry of Education of ChinaHangzhouChina Yaowu Chen [email protected] 0000-0001-7266-1535 Zhejiang UniversityHangzhouChina Zhejiang University Embedded System Engineering Research Center, Ministry of Education of ChinaHangzhouChina  and  Jieping Ye [email protected] 0000-0001-8662-5818 Alibaba CloudHangzhouChina
(2024)
Abstract.

Incomplete multi-view clustering (IMVC) aims to cluster multi-view data that are only partially available. This poses two main challenges: effectively leveraging multi-view information and mitigating the impact of missing views. Prevailing solutions employ cross-view contrastive learning and missing view recovery techniques. However, they either neglect valuable complementary information by focusing only on consensus between views or provide unreliable recovered views due to the absence of supervision. To address these limitations, we propose a novel Unified and Robust Representation Learning for Incomplete Multi-View Clustering (URRL-IMVC). URRL-IMVC directly learns a unified embedding that is robust to view missing conditions by integrating information from multiple views and neighboring samples. Firstly, to overcome the limitations of cross-view contrastive learning, URRL-IMVC incorporates an attention-based auto-encoder framework to fuse multi-view information and generate unified embeddings. Secondly, URRL-IMVC directly enhances the robustness of the unified embedding against view-missing conditions through KNN imputation and data augmentation techniques, eliminating the need for explicit missing view recovery. Finally, incremental improvements are introduced to further enhance the overall performance, such as the Clustering Module and the customization of the Encoder. We extensively evaluate the proposed URRL-IMVC framework on various benchmark datasets, demonstrating its state-of-the-art performance. Furthermore, comprehensive ablation studies are performed to validate the effectiveness of our design.

Deep Learning; Representation Learning; Self-supervised Learning; Multi-view Learning; Incomplete Multi-view Clustering
journalyear: 2024copyright: acmlicensedconference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29, 2024; Barcelona, Spainbooktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spaindoi: 10.1145/3637528.3671887isbn: 979-8-4007-0490-1/24/08ccs: Computing methodologies Artificial intelligenceccs: Computing methodologies Learning latent representationsccs: Computing methodologies Cluster analysis

1. Introduction

Multi-view data (Fu et al., 2020) is commonly collected and utilized in various domains, making multi-view clustering (MVC) a crucial tool for analyzing such data and uncovering its underlying structures (Chao et al., 2021; Chen et al., 2022). Previous research has proposed several approaches (Xu et al., 2022, 2021) achieving promising performance by exploiting consensus or complementary information between views. However, in real-world applications, some views may be partially unavailable due to sensor malfunctions or other practical reasons. Existing MVC methods heavily rely on complete views to learn a comprehensive representation for clustering, making them inadequate under such conditions. To address this issue, Incomplete Multi-view Clustering (IMVC) methods have been introduced to reduce the impact of missing views (Wen et al., 2023). Various IMVC approaches have been proposed, including matrix decomposition (Li et al., 2014), kernel-based (Liu et al., 2017), and graph-based (Gao et al., 2016) methods. With the superior feature representation ability demonstrated by deep learning, some IMVC methods have integrated deep learning techniques, known as Deep Incomplete Multi-view Clustering (DIMVC) methods, which we will mainly discuss below. The key challenges in the IMVC task revolve around two problems: i) effectively utilizing multi-view information, and ii) mitigating the impact of missing views. Previous DIMVC works (Wang et al., 2018; Lin et al., 2021, 2023; Jin et al., 2023; Liu et al., 2023) have employed two mainstream strategies to address these problems: 1) cross-view contrastive learning, and 2) missing view recovery. However, these strategies have inherent drawbacks.

A general framework for cross-view contrastive learning is illustrated in Fig 1a, which originates from the MVC approaches. In this framework, Deep Neural Network (DNN) auto-encoders are employed to extract embeddings for each view. The embeddings are then aligned using a contrastive loss, aiming to minimize the distance between embeddings from the same sample across different views while simultaneously maximizing the distance with other samples (Jin et al., 2023; Lin et al., 2021; Yang et al., 2023). However, this framework primarily focuses on extracting consensus information in multi-view data, overlooking the valuable complementary information present. Additionally, the efficiency of the pair-wise contrastive strategy suffers as the number of views increases, and the effectiveness of this strategy diminishes due to less overlapped information between views (See Table 5 for experimental analysis). Theoretical analysis by Trosten et al. (2023) supports these observations, highlighting that contrastive alignment can reduce the number of separable clusters in the representation space, with this effect worsening as the number of views increases.

Refer to caption
(a) Cross-view contrastive learning framework
Refer to caption
(b) Missing view recovery framework
Refer to caption
(c) Our unified and robust learning framework
Figure 1. A comparison between our learning framework and commonly used cross-view contrastive learning and missing view recovery framework. The key difference lies in how the unified embedding for clustering is obtained. Our design (1c) directly fuses multi-view information and utilizes KNN imputation and data augmentation to obtain unified and robust embedding under view-missing conditions, avoiding the drawbacks of (1a) and (1b).
\Description

Our framework avoids the drawbacks of cross-view contrastive learning and missing view recovery, by directly learning a unified and robust embedding.

The missing view recovery framework, as depicted in 1b, is commonly adopted in IMVC approaches. Typically, a DNN is employed to recover the missing view, either in the data or latent space. Subsequently, MVC methods or another view fusion network are utilized for clustering based on the recovered views. However, the reliability of the recovered views is a concern since the recovery ability of DNNs relies on unsupervised training. Meanwhile in some instances, (Liu et al., 2023) for example, missing views are recovered by a fused embedding in the first stage, and subsequently used to generate another fused embedding for clustering in the second stage, introducing unnecessary complexity and inefficiency to the pipeline. We propose that a well-designed recovery-free method can achieve comparable performance to recovery-based methods while offering the advantages of simplicity and reduced computational overhead.

To address the aforementioned challenges, we propose a Unified and Robust Representation Learning framework for Incomplete Multi-View Clustering (URRL-IMVC). Our framework, depicted in Fig 1c, is designed to be cross-view contrastive learning-free and missing view recovery-free. First, to overcome the limitations of cross-view contrastive learning, we propose a new framework that fuses multi-view information into a unified embedding instead of contrasting each view’s information. We achieve this by designing an attention-based auto-encoder network, which captures both consensus and complementary information and intelligently fuses them. Moreover, it is naturally scalable to different numbers of views. Second, to tackle the issue of missing views, we aim to directly enhance the robustness of the unified embedding against view-missing conditions without explicitly recovering the missing views. We introduce two strategies to achieve this robustness. 1) We treat view missing as a form of noise and draw inspiration from successful applications of denoising and masked auto-encoders (Vincent et al., 2008; He et al., 2022). Our proposed approach randomly drops out existing views as a form of data augmentation to simulate the view missing condition. By reconstructing denoised input data from the unified embedding and imposing constraints between the augmented and un-augmented embeddings, we enhance the robustness of the unified representation. 2) As the old saying goes, “One cannot make bricks without straw”, it is hard to learn to reconstruct a dropped-out view directly. We introduce k-nearest neighbors (KNN) as additional inputs, with a cross-view imputation strategy to fill in the missing or dropped-out views, providing valuable hints for reconstruction. We want to highlight that while previous methods have focused on either fusing multi-view information (Wang et al., 2021; Lin et al., 2022) or incorporating neighborhood information (Nguyen et al., 2021; Wang et al., 2019; Yang et al., 2020; Tu et al., 2021) for clustering, our approach represents one of the initial endeavors to fuse both aspects. Finally, we conduct experiments based on this framework and make incremental improvements to enhance clustering performance and stability. Some of the key enhancements include the customization of the Transformer-based Encoder to filter out noise and emphasize critical information, and the introduction of the Clustering Module to learn clustering-friendly representations.

To summarize, our main contributions are:

  • Unified: We propose a unified representation learning framework that efficiently fuses both multi-view and neighborhood information, allowing for better capturing of consensus and complementary information while avoiding the limitations of cross-view contrastive learning.

  • Robust: We proposed novel strategies, including KNN imputation and data augmentation, to directly learn a robust representation capable of handling view-missing conditions without explicit missing view recovery.

  • Improvements: Multiple incremental improvements are introduced for better clustering performance and stability, including the extra Clustering Module and the customization of the Transformer-based Encoder.

  • Experiments: Through comprehensive experiments on diverse benchmark datasets, we demonstrate the state-of-the-art performance of our unified representation learning framework. Thorough ablation studies are also conducted to provide valuable insights for future research in this field.

2. Related works

Deep neural networks (DNNs) have shown good performance in learning feature representation, which is beneficial for the IMVC task. Various IMVC approaches have integrated DNNs into their framework, denoted as DIMVC approaches. In terms of network architecture, DIMVC approaches can be divided into four categories. (1) Auto-encoder-based approaches (Lin et al., 2022; Jin et al., 2023; Lin et al., 2021, 2023). These approaches utilize auto-encoders to extract high-level features of each view, which are usually combined with contrastive learning or cross-view prediction to handle the incompleteness problem. (2) Generative network-based approaches. For the IMVC task, an intuitive solution is to complete the missing views with generative models, transforming it into an MVC task. Adversarial learning (Goodfellow et al., 2014) is commonly adopted by generative IMVC approaches including AIMVC Xu et al. (2019), PMVC-CG Wang et al. (2018), and GP-MVC Wang et al. (2021) to improve data distribution learning in the context of IMVC. (3) Graph Neural Network-based (GNN-based) approaches (Wang et al., 2022, 2018). These approaches aim to learn consensus representations from the structure information contained in the graphs constructed for each view. (4) Transformer (Vaswani et al., 2017) or attention-based approaches. The Transformer network has gained attention in recent years due to its successful application in various domains. Its architecture, along with its Multi-head Attention mechanism, has been particularly effective in capturing complex relationships. In the field of DIMVC, RecFormer (Liu et al., 2023) proposed a Transformer auto-encoder with a mask to recover missing views, while MCAC (Zhang and Zhu, 2023) and IMVC-PBI (Li et al., 2023) incorporated attention mechanisms into their frameworks. In this paper, we leverage an auto-encoder architecture based on the Transformer framework to address the challenges of the IMVC task.

3. The Proposed Method

Notations. An incomplete multi-view dataset with N𝑁Nitalic_N samples and V𝑉Vitalic_V views is denoted as X={X(1),X(2),,X(V)},X(v)N×dvformulae-sequence𝑋superscript𝑋1superscript𝑋2superscript𝑋𝑉superscript𝑋𝑣superscript𝑁subscript𝑑𝑣X=\{X^{(1)},X^{(2)},\cdots,X^{(V)}\},X^{(v)}\in\mathbb{R}^{N\times d_{v}}italic_X = { italic_X start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , italic_X start_POSTSUPERSCRIPT ( italic_V ) end_POSTSUPERSCRIPT } , italic_X start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes the dimension of v𝑣vitalic_v-th view. The view missing condition can be described by a binary missing indicator matrix M{0,1}N×V𝑀superscript01𝑁𝑉M\in\{0,1\}^{N\times V}italic_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_V end_POSTSUPERSCRIPT, where Mij=0subscript𝑀𝑖𝑗0M_{ij}=0italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 indicates the j𝑗jitalic_j-th view of the i𝑖iitalic_i-th sample is missing and Mij=1subscript𝑀𝑖𝑗1M_{ij}=1italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 just the opposite. An extra restriction is imposed: jMij1subscript𝑗subscript𝑀𝑖𝑗1\sum_{j}M_{ij}\geq 1∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≥ 1, ensuring that at least one view is available for each sample, which is essential for the clustering task.

3.1. Framework

Refer to caption
Figure 2. The overall architecture of URRL-IMVC. During training, the input data is augmented to simulate view-missing conditions, and KNN Imputation provides hints for missing views, forming an input batch with both neighbor and view dimensions. This batch is fed into the auto-encoder network, consisting of the Encoder (including the Neighbor Dimensional Encoder and View Dimensional Encoder), the Decoder, and the Clustering Module. The Encoders fuse information from the neighbor and view dimensions to generate a unified embedding. The Decoder reconstructs the augmented input, and the Clustering Module produces clustering results. Additionally, an un-augmented embedding is obtained by passing the original input data through the shared Encoders. Three loss functions, including Reconstruction loss, Robustness loss, and Clustering loss, enhance robustness against view-missing conditions and encourage learning clustering-friendly embeddings.
\Description

The architecture can be better understood together with the equations below.

Unlike many prior approaches in the field of MVC that employ view-specific auto-encoders for each view, we propose a novel framework using a unified auto-encoder that effectively fuses multi-view data. The network architecture, depicted in Fig 2, consists of three key modules: the Encoder f𝑓fitalic_f, the Decoder g𝑔gitalic_g, and the Clustering Module hhitalic_h. To provide a formal description, the framework operates as follows. Given an incomplete multi-view data sample 𝒙={𝒙(1),𝒙(2),,𝒙(V)},𝒙(v)dvformulae-sequence𝒙superscript𝒙1superscript𝒙2superscript𝒙𝑉superscript𝒙𝑣superscriptsubscript𝑑𝑣{\bm{x}}=\{{\bm{x}}^{(1)},{\bm{x}}^{(2)},\cdots,{\bm{x}}^{(V)}\},{\bm{x}}^{(v)% }\in\mathbb{R}^{d_{v}}bold_italic_x = { bold_italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , bold_italic_x start_POSTSUPERSCRIPT ( italic_V ) end_POSTSUPERSCRIPT } , bold_italic_x start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from dataset X𝑋Xitalic_X with its missing indicator vector 𝒎{0,1}V𝒎superscript01𝑉{\bm{m}}\in\{0,1\}^{V}bold_italic_m ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT, we apply KNN Imputation and Data Augmentation (KIDA), as described in section 3.2, to obtain the input for the auto-encoder network,

(1) 𝒙¯,𝒙¯,𝒎¯,𝒎¯=KIDA(𝒙,𝒎,X,M)𝒙¯(v),𝒙¯(v)k×dv;𝒎¯,𝒎¯{0,1}k×Vformulae-sequence¯𝒙superscript¯𝒙¯𝒎superscript¯𝒎𝐾𝐼𝐷𝐴𝒙𝒎𝑋𝑀superscript¯𝒙𝑣formulae-sequencesuperscript¯𝒙𝑣superscript𝑘subscript𝑑𝑣¯𝒎superscript¯𝒎superscript01𝑘𝑉\begin{split}&\bar{\bm{x}},\bar{\bm{x}}^{\prime},\bar{\bm{m}},\bar{\bm{m}}^{% \prime}=KIDA({\bm{x}},{\bm{m}},X,M)\\ &\bar{{\bm{x}}}^{(v)},\bar{{\bm{x}}}^{\prime(v)}\in\mathbb{R}^{k\times d_{v}};% \ \bar{{\bm{m}}},\bar{{\bm{m}}}^{\prime}\in\{0,1\}^{k\times V}\end{split}start_ROW start_CELL end_CELL start_CELL over¯ start_ARG bold_italic_x end_ARG , over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG bold_italic_m end_ARG , over¯ start_ARG bold_italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_K italic_I italic_D italic_A ( bold_italic_x , bold_italic_m , italic_X , italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT , over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ ( italic_v ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; over¯ start_ARG bold_italic_m end_ARG , over¯ start_ARG bold_italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k × italic_V end_POSTSUPERSCRIPT end_CELL end_ROW

where 𝒙¯¯𝒙\bar{\bm{x}}over¯ start_ARG bold_italic_x end_ARG, 𝒎¯¯𝒎\bar{\bm{m}}over¯ start_ARG bold_italic_m end_ARG is the data and mask after KNN Imputation, while 𝒙¯superscript¯𝒙\bar{\bm{x}}^{\prime}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝒎¯superscript¯𝒎\bar{\bm{m}}^{\prime}over¯ start_ARG bold_italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the augmented version of 𝒙¯¯𝒙\bar{\bm{x}}over¯ start_ARG bold_italic_x end_ARG, 𝒎¯¯𝒎\bar{\bm{m}}over¯ start_ARG bold_italic_m end_ARG, and k𝑘kitalic_k is the hyperparameter k𝑘kitalic_k in KNN. Note that though KNN Imputation is widely applied in prior IMVC works, it is mainly used for recovering missing views, which is different from our usage as a pre-process. Next, these inputs are fed into the Encoder network to obtain the augmented and un-augmented embeddings, denoted as 𝒛superscript𝒛{\bm{z}}^{\prime}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒛𝒛{\bm{z}}bold_italic_z respectively,

(2) 𝒛=f(𝒙¯,𝒎¯;𝜽E),𝒛=f(𝒙¯,𝒎¯;𝜽E);𝒛,𝒛deformulae-sequence𝒛𝑓¯𝒙¯𝒎subscript𝜽𝐸formulae-sequencesuperscript𝒛𝑓superscript¯𝒙superscript¯𝒎subscript𝜽𝐸𝒛superscript𝒛superscriptsubscript𝑑𝑒{\bm{z}}=f(\bar{{\bm{x}}},\bar{{\bm{m}}};{\bm{\theta}}_{E}),\ {\bm{z}}^{\prime% }=f(\bar{{\bm{x}}}^{\prime},\bar{{\bm{m}}}^{\prime};{\bm{\theta}}_{E});\ {\bm{% z}},{\bm{z}}^{\prime}\in\mathbb{R}^{d_{e}}bold_italic_z = italic_f ( over¯ start_ARG bold_italic_x end_ARG , over¯ start_ARG bold_italic_m end_ARG ; bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f ( over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG bold_italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) ; bold_italic_z , bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

where 𝜽Esubscript𝜽𝐸{\bm{\theta}}_{E}bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT represents the Encoder’s parameters and desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the dimension of the embedding. Then, the Decoder maps the augmented embedding back to the data space to reconstruct the data sample,

(3) 𝒙^=g(𝒛;𝜽D),𝒙^(v)dvformulae-sequencesuperscript^𝒙𝑔superscript𝒛subscript𝜽𝐷superscript^𝒙𝑣superscriptsubscript𝑑𝑣\hat{{\bm{x}}}^{\prime}=g({\bm{z}}^{\prime};{\bm{\theta}}_{D}),\ \hat{{\bm{x}}% }^{\prime(v)}\in\mathbb{R}^{d_{v}}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_g ( bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ ( italic_v ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

where 𝜽Dsubscript𝜽𝐷{\bm{\theta}}_{D}bold_italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT represents the parameters of the Decoder. Simultaneously, clustering is performed using the un-augmented embedding,

(4) 𝒄=h(𝒛;𝜽C),𝒄[0,1]dcformulae-sequence𝒄𝒛subscript𝜽𝐶𝒄superscript01subscript𝑑𝑐{\bm{c}}=h({\bm{z}};{\bm{\theta}}_{C}),\ {\bm{c}}\in[0,1]^{d_{c}}bold_italic_c = italic_h ( bold_italic_z ; bold_italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) , bold_italic_c ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

where 𝒄𝒄{\bm{c}}bold_italic_c is the clustering result, and represents the probabilities of the data sample belonging to dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT cluster centers. During training, the loss function defined in equation 19 is computed to optimize parameters 𝜽E,𝜽D,𝜽Csubscript𝜽𝐸subscript𝜽𝐷subscript𝜽𝐶{\bm{\theta}}_{E},{\bm{\theta}}_{D},{\bm{\theta}}_{C}bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT; During testing, 𝒄𝒄{\bm{c}}bold_italic_c is regarded as the final clustering result.

In the following sections, we will introduce the Encoder module, including its two submodules: the Neighbor Dimensional Encoder (NDE) and the View Dimensional Encoder (VDE), the Decoder module, and the Clustering Module respectively.

3.1.1. Neighbor Dimensional Encoder

KNN Imputation (Section 3.2) provides additional information for missing views, but the retrieved nearest neighbors may contain noise and be unreliable. To address this issue, we propose the Neighbor Dimensional Encoder (NDE), which is a series of customized Transformer Encoders (Vaswani et al., 2017), with each one dedicated to a view to fuse its KNN input and filter out noise, formulated as:

(5) 𝒙NDE={𝒙NDE(1),𝒙NDE(2),,𝒙NDE(V)},𝒙NDE(v)dvformulae-sequencesubscript𝒙𝑁𝐷𝐸superscriptsubscript𝒙𝑁𝐷𝐸1superscriptsubscript𝒙𝑁𝐷𝐸2superscriptsubscript𝒙𝑁𝐷𝐸𝑉superscriptsubscript𝒙𝑁𝐷𝐸𝑣superscriptsubscript𝑑𝑣{\bm{x}}_{\scriptscriptstyle NDE}=\{{\bm{x}}_{\scriptscriptstyle NDE}^{(1)},{% \bm{x}}_{\scriptscriptstyle NDE}^{(2)},\cdots,{\bm{x}}_{\scriptscriptstyle NDE% }^{(V)}\},\ {\bm{x}}_{\scriptscriptstyle NDE}^{(v)}\in\mathbb{R}^{d_{v}}bold_italic_x start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT = { bold_italic_x start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_V ) end_POSTSUPERSCRIPT } , bold_italic_x start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
(6) 𝒙NDE(v)=fNDE(v)(CDPE(𝒙¯(v)),𝒎¯;𝜽NDE(v))[0,:]superscriptsubscript𝒙𝑁𝐷𝐸𝑣superscriptsubscript𝑓𝑁𝐷𝐸𝑣subscript𝐶𝐷𝑃𝐸superscript¯𝒙𝑣¯𝒎superscriptsubscript𝜽𝑁𝐷𝐸𝑣0:{\bm{x}}_{\scriptscriptstyle NDE}^{(v)}=f_{\scriptscriptstyle NDE}^{(v)}(CDPE(% \bar{{\bm{x}}}^{(v)}),\bar{\bm{m}};{\bm{\theta}}_{\scriptscriptstyle NDE}^{(v)% })_{[0,:]}bold_italic_x start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ( italic_C italic_D italic_P italic_E ( over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ) , over¯ start_ARG bold_italic_m end_ARG ; bold_italic_θ start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT [ 0 , : ] end_POSTSUBSCRIPT

The v𝑣vitalic_vth Transformer Encoder corresponding to the v𝑣vitalic_vth view is represented with fNDE(v)(;𝜽NDE(v))superscriptsubscript𝑓𝑁𝐷𝐸𝑣superscriptsubscript𝜽𝑁𝐷𝐸𝑣f_{\scriptscriptstyle NDE}^{(v)}(\cdot\ ;{\bm{\theta}}_{\scriptscriptstyle NDE% }^{(v)})italic_f start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ) in equation 6, and 𝜽NDE(v)superscriptsubscript𝜽𝑁𝐷𝐸𝑣{\bm{\theta}}_{\scriptscriptstyle NDE}^{(v)}bold_italic_θ start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT is its parameters. The input KNN sequence from the v𝑣vitalic_vth view 𝒙¯(v)superscript¯𝒙𝑣\bar{{\bm{x}}}^{(v)}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT is first processed to add the Cosine Distance-based Positional Encoding (CDPE), then it is passed through the Transformer Encoder with the KNN mask 𝒎¯¯𝒎\bar{\bm{m}}over¯ start_ARG bold_italic_m end_ARG. Finally, only the first vector from the output sequence is chosen as the output, denoted as [0,:]0:[0,:][ 0 , : ].

Below we introduce the two key customizations of the Transformer Encoders in NDE: the CDPE and the output choice.

Cosine Distance-based Positional Encoding (CDPE)

The order or distance of the KNN instances contains vital information regarding the reliability of the inputs, with farther neighbors noisier and less reliable. To capture this information for the permutation invariant Transformer structure, we introduce Positional Encoding (PE) to provide this extra KNN order information. We explored various positional encoding (PE) designs considering the data sources and their combination with data. Among these configurations, concatenating cosine distance-based (inspired by Nguyen et al. (2021)) or learnable PE with the input yielded the best results. For better interpretability, Cosine Distance-based Positional Encoding (CDPE) is chosen as our final design. The CDPE can be explained as,

(7) CDPE(𝒙¯(v))=𝒙¯(v)d(𝒙¯(v)),d(𝒙¯(v))k×kformulae-sequence𝐶𝐷𝑃𝐸superscript¯𝒙𝑣direct-sumsuperscript¯𝒙𝑣𝑑superscript¯𝒙𝑣𝑑superscript¯𝒙𝑣superscript𝑘𝑘CDPE(\bar{\bm{x}}^{(v)})=\bar{{\bm{x}}}^{(v)}\oplus d(\bar{{\bm{x}}}^{(v)}),\ % d(\bar{{\bm{x}}}^{(v)})\in\mathbb{R}^{k\times k}italic_C italic_D italic_P italic_E ( over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ) = over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ⊕ italic_d ( over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ) , italic_d ( over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT

in which d()𝑑d()italic_d ( ) is the function calculating the pair-wise distance of k𝑘kitalic_k vectors and return a k×k𝑘𝑘k\times kitalic_k × italic_k distance matrix, and direct-sum\oplus stands for matrix concatenation. Given two input vectors 𝒙1subscript𝒙1{\bm{x}}_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒙2subscript𝒙2{\bm{x}}_{2}bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the pair-wise cosine distance is formulated as,

(8) dcos(𝒙1,𝒙2)=1𝒙1𝒙2𝒙1𝒙2subscript𝑑𝑐𝑜𝑠subscript𝒙1subscript𝒙21subscript𝒙1subscript𝒙2normsubscript𝒙1normsubscript𝒙2d_{cos}({\bm{x}}_{1},{\bm{x}}_{2})=1-\frac{{\bm{x}}_{1}\cdot{\bm{x}}_{2}}{||{% \bm{x}}_{1}||\cdot||{\bm{x}}_{2}||}italic_d start_POSTSUBSCRIPT italic_c italic_o italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 - divide start_ARG bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | ⋅ | | bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | end_ARG

We conjecture that CDPE is the most suitable for two reasons. First, it contains the KNN distance information rather than simply providing order information. Second, its value range is 0-2, which is more stable than other distance functions, e.g., Euclidean distance.

Output choice

(Figure 3a) Generally, for fusing information with a Transformer Encoder, an additional token like [CLS] can be added (Devlin et al., 2019). However, in our unsupervised task, adding such a meaningless token can introduce noise and lead to performance degradation. Instead, we adopt the first vector of the output sequence. This design not only avoids extra noise but also introduces a bias on the first input. The first input is always the most reliable sample, i.e., the center sample for an existing view or the nearest neighbor for a missing view. By introducing this bias, important information is emphasized while fusing KNN information.

3.1.2. View Dimensional Encoder

The View Dimensional Encoder (VDE) is designed to fuse view representations and obtain unified embedding. As depicted in Figure 2, it consists of two parts, with firstly a Feed-Forward Network (FFN) to map the representations of different dimensions to the same latent space, and then followed by a Transformer Encoder for fusion. The FFN consists of three fully connected (FC) layers, without normalization or dropout layers, which can be detrimental to the stability of training. The FFN of VDE can be formulated as:

(9) 𝒙VDEF=𝒙VDEF(1)𝒙VDEF(2)𝒙VDEF(V),𝒙VDEFV×deformulae-sequencesubscript𝒙𝑉𝐷𝐸𝐹direct-sumsuperscriptsubscript𝒙𝑉𝐷𝐸𝐹1superscriptsubscript𝒙𝑉𝐷𝐸𝐹2superscriptsubscript𝒙𝑉𝐷𝐸𝐹𝑉subscript𝒙𝑉𝐷𝐸𝐹superscript𝑉subscript𝑑𝑒{\bm{x}}_{\scriptscriptstyle VDE-F}={\bm{x}}_{\scriptscriptstyle VDE-F}^{(1)}% \oplus{\bm{x}}_{\scriptscriptstyle VDE-F}^{(2)}\oplus\cdots\oplus{\bm{x}}_{% \scriptscriptstyle VDE-F}^{(V)},\ {\bm{x}}_{\scriptscriptstyle VDE-F}\in% \mathbb{R}^{V\times d_{e}}bold_italic_x start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_F end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ⊕ bold_italic_x start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ⊕ ⋯ ⊕ bold_italic_x start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_V ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_F end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
(10) 𝒙VDEF(v)=σ(σ(𝒙NDE(v)𝑾1(v)+𝒃1(v))𝑾2(v)+𝒃2(v))𝑾3(v)+𝒃3(v)superscriptsubscript𝒙𝑉𝐷𝐸𝐹𝑣𝜎𝜎superscriptsubscript𝒙𝑁𝐷𝐸𝑣superscriptsubscript𝑾1𝑣superscriptsubscript𝒃1𝑣superscriptsubscript𝑾2𝑣superscriptsubscript𝒃2𝑣superscriptsubscript𝑾3𝑣superscriptsubscript𝒃3𝑣{\bm{x}}_{\scriptscriptstyle VDE-F}^{(v)}=\sigma(\sigma({\bm{x}}_{% \scriptscriptstyle NDE}^{(v)}\bm{W}_{1}^{(v)}+\bm{b}_{1}^{(v)})\bm{W}_{2}^{(v)% }+\bm{b}_{2}^{(v)})\bm{W}_{3}^{(v)}+\bm{b}_{3}^{(v)}bold_italic_x start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT = italic_σ ( italic_σ ( bold_italic_x start_POSTSUBSCRIPT italic_N italic_D italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ) bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ) bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT

In the equation, direct-sum\oplus represents the concatenate operation, σ𝜎\sigmaitalic_σ is the activation function, 𝑾𝑾\bm{W}bold_italic_W and 𝒃𝒃\bm{b}bold_italic_b are the weight matrix and bias vector of the FC layer respectively.

The Transformer Encoder part of the VDE can be explained as,

(11) 𝒛=v=1VfVDET(𝒙VDEF,TAM(𝒎¯,𝒎);𝜽VDET)/V𝒛superscriptsubscript𝑣1𝑉subscript𝑓𝑉𝐷𝐸𝑇subscript𝒙𝑉𝐷𝐸𝐹𝑇𝐴𝑀¯𝒎𝒎subscript𝜽𝑉𝐷𝐸𝑇𝑉{\bm{z}}=\sum_{v=1}^{V}f_{\scriptscriptstyle VDE-T}({\bm{x}}_{% \scriptscriptstyle VDE-F},TAM(\bar{\bm{m}},{\bm{m}});{\bm{\theta}}_{% \scriptscriptstyle VDE-T})/Vbold_italic_z = ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_F end_POSTSUBSCRIPT , italic_T italic_A italic_M ( over¯ start_ARG bold_italic_m end_ARG , bold_italic_m ) ; bold_italic_θ start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_T end_POSTSUBSCRIPT ) / italic_V

in which fVDET(;𝜽VDET)subscript𝑓𝑉𝐷𝐸𝑇subscript𝜽𝑉𝐷𝐸𝑇f_{\scriptscriptstyle VDE-T}(\cdot\ ;{\bm{\theta}}_{\scriptscriptstyle VDE-T})italic_f start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_T end_POSTSUBSCRIPT ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_T end_POSTSUBSCRIPT ) is the VDE Transformer Encoder structure. The view representations 𝒙VDEFsubscript𝒙𝑉𝐷𝐸𝐹{\bm{x}}_{\scriptscriptstyle VDE-F}bold_italic_x start_POSTSUBSCRIPT italic_V italic_D italic_E - italic_F end_POSTSUBSCRIPT are passed through the Transformer Encoder along with the generated Three-level Adaptive Mask (TAM) TAM(𝒎¯,𝒎)𝑇𝐴𝑀¯𝒎𝒎TAM(\bar{\bm{m}},{\bm{m}})italic_T italic_A italic_M ( over¯ start_ARG bold_italic_m end_ARG , bold_italic_m ). The output sequence is averaged for fusion. The Transformer Encoder in VDE is also customized but different from that in NDE. The key difference lies in that views are permutation invariant, i.e., changing the order of views should yield the same output, while KNN has an order. Based on this difference, the two key customizations of the VDE Transformer are designed as:

Three-level Adaptive Masking (TAM)

We employ a masking mechanism in VDE to emphasize the reliability of the inputs, instead of the positional encoding used in NDE to maintain permutation invariant. The self-attention in Transformer is formulated as,

(12) Attention(Q,K,V,MA)=softmax(QKTdk+MA)V𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑄𝐾𝑉subscript𝑀𝐴𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇subscript𝑑𝑘subscript𝑀𝐴𝑉Attention(Q,K,V,M_{A})=softmax(\frac{QK^{T}}{\sqrt{d_{k}}}+M_{A})Vitalic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V , italic_M start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + italic_M start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) italic_V

where MAsubscript𝑀𝐴M_{A}italic_M start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is the mask applied, with negative infinity for masking and 0 for not. The input view representations can be roughly divided into 3 categories based on data completeness: (1) complete, (2) missing view with KNN imputation, and (3) missing view without imputation. Therefore, instead of the original binary mask, we design a Three-level Adaptive Mask (TAM) for the 3 categories, the mask values range from completely unmasked (1) to fully masked (3), with an intermediate level (2) in between, formulated as,

(13) MA=TAM(𝒎¯,𝒎)={0,𝒎j=1γ,i=1k𝒎¯ij>0&𝒎j=0,i=1k𝒎¯ij=0&𝒎j=0M_{A}=TAM(\bar{\bm{m}},{\bm{m}})=\left\{\begin{aligned} 0,&\ {\bm{m}}_{j}=1\\ \gamma,&\ \sum_{i=1}^{k}\bar{\bm{m}}_{ij}>0\ \&\ {\bm{m}}_{j}=0\\ -\infty,&\ \sum_{i=1}^{k}\bar{\bm{m}}_{ij}=0\ \&\ {\bm{m}}_{j}=0\\ \end{aligned}\right.italic_M start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_T italic_A italic_M ( over¯ start_ARG bold_italic_m end_ARG , bold_italic_m ) = { start_ROW start_CELL 0 , end_CELL start_CELL bold_italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 end_CELL end_ROW start_ROW start_CELL italic_γ , end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > 0 & bold_italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL - ∞ , end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 & bold_italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 end_CELL end_ROW

in which 𝒎𝒎{\bm{m}}bold_italic_m and 𝒎¯¯𝒎\bar{\bm{m}}over¯ start_ARG bold_italic_m end_ARG are the original and KNN imputation generated missing matrix respectively. γ𝛾\gammaitalic_γ is a negative hyperparameter to control the emphasizing intensity.

Output choice

(Figure 3b) To ensure permutation invariance and avoid bias towards any views, the embedding is generated by averaging all Transformer output vectors. The output choices of NDE and VDE are depicted in Figure 3b for intuitive understanding.

Refer to caption
(a) Neighbor Dimensional Encoder
Refer to caption
(b) View Dimensional Encoder
Figure 3. An intuitive visualization of the output choice of the Neighbor Dimensional Encoder (NDE) and View Dimensional Encoder (VDE). In NDE, the first vector of the output sequence is chosen to provide a bias on the most reliable input. In VDE, the outputs are averaged to provide an unbiased representation of all views.
\Description

The effectiveness of this design is proved by ablation studies in the Appendix.

3.1.3. Decoder

The Decoder in our model is designed as a compact 4-layer FFN, to reconstruct the input from the unified embedding. Similar to the FFN in VDE, we removed its normalization and dropout layer for better stability. Through our experiments, we have observed that a deep and complex Decoder does not necessarily improve the clustering performance and may even have negative effects in certain cases. One possible explanation for this phenomenon is that a shallow and simple Decoder serves as a regularization technique on the embedding space, and prevents it from collapsing. This regularization effect is similar to the Locality-preserving Constraint proposed by Huang et al. (2014), which helps preserve the local structure of the data. The process of Decoder is formulated as,

(14) 𝒙^={𝒙^(1),𝒙^(2),,𝒙^(V)},𝒙^(v)dvformulae-sequence^𝒙superscript^𝒙1superscript^𝒙2superscript^𝒙𝑉superscript^𝒙𝑣superscriptsubscript𝑑𝑣\hat{\bm{x}}=\{\hat{\bm{x}}^{(1)},\hat{\bm{x}}^{(2)},\cdots,\hat{\bm{x}}^{(V)}% \},\ \hat{\bm{x}}^{(v)}\in\mathbb{R}^{d_{v}}over^ start_ARG bold_italic_x end_ARG = { over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_V ) end_POSTSUPERSCRIPT } , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
(15) 𝒙^(v)=g(v)(𝒛;𝜽D(v))superscript^𝒙𝑣superscript𝑔𝑣𝒛superscriptsubscript𝜽𝐷𝑣\hat{\bm{x}}^{(v)}=g^{(v)}({\bm{z}};{\bm{\theta}}_{D}^{(v)})over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT = italic_g start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ( bold_italic_z ; bold_italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT )

3.1.4. Clustering Module

The auto-encoder we have designed extracts robust representations and captures the inherent structures of data. However, these inherent structures may not necessarily follow a cluster-oriented distribution. To enhance the clustering performance, we introduce a Clustering Module inspired by DEC (Xie et al., 2016). Below we describe its procedures. First, after auto-encoder pretraining, a traditional clustering method is adopted to initialize the cluster centers 𝜽Cdc×desubscript𝜽𝐶superscriptsubscript𝑑𝑐subscript𝑑𝑒{\bm{\theta}}_{C}\in\mathbb{R}^{d_{c}\times d_{e}}bold_italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from embedding 𝒛𝒛{\bm{z}}bold_italic_z. Then, during each iteration of joint training, a similarity matrix 𝒒[0,1]N×dc𝒒superscript01𝑁subscript𝑑𝑐{\bm{q}}\in[0,1]^{N\times d_{c}}bold_italic_q ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is generated between 𝜽Csubscript𝜽𝐶{\bm{\theta}}_{C}bold_italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and embedding 𝒛𝒛{\bm{z}}bold_italic_z with Student’s t-distribution, in which qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the possibility sample i𝑖iitalic_i belongs to cluster j𝑗jitalic_j, and 𝒄dc𝒄superscriptsubscript𝑑𝑐{\bm{c}}\in\mathbb{R}^{d_{c}}bold_italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT defined in equation 4 is a row of 𝒒𝒒{\bm{q}}bold_italic_q. With the similarity matrix 𝒒𝒒{\bm{q}}bold_italic_q, we calculate the target distribution 𝒑𝒑{\bm{p}}bold_italic_p as,

(16) pij=qij2/fjj=1dc(qij2/fj)subscript𝑝𝑖𝑗superscriptsubscript𝑞𝑖𝑗2subscript𝑓𝑗superscriptsubscript𝑗1subscript𝑑𝑐superscriptsubscript𝑞𝑖𝑗2subscript𝑓𝑗p_{ij}=\frac{q_{ij}^{2}/f_{j}}{\sum_{j=1}^{d_{c}}(q_{ij}^{2}/f_{j})}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG

in which fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the soft cluster size i=0N1qijsuperscriptsubscript𝑖0𝑁1subscript𝑞𝑖𝑗\sum_{i=0}^{N-1}q_{ij}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. The training target is the KL-Divergence between 𝒑𝒑{\bm{p}}bold_italic_p and 𝒒𝒒{\bm{q}}bold_italic_q (equation 22) which is defined in Section 3.3.1. Finally, after joint training, the clustering result can be obtained by finding the maximum possibility in each row of 𝒒𝒒{\bm{q}}bold_italic_q. The detailed training procedures are described in Section 3.3.2.

3.2. Data Augmentation and Imputation

In this section, we introduce Data Augmentation and KNN Imputation (KIDA()𝐾𝐼𝐷𝐴KIDA(\cdot)italic_K italic_I italic_D italic_A ( ⋅ ) in equation 1), which are two key data processing strategies of our approach for handling incomplete multi-view data.

3.2.1. KNN Imputation

The KNN search is conducted separately in each view, taking into account the incomplete condition. For existing views, the KNN is directly obtained, while for missing views, existing views from the same sample are used for searching KNN. Specifically, for a missing view 𝒙i(v)superscriptsubscript𝒙𝑖𝑣{\bm{x}}_{i}^{(v)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT of sample 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we first find all existing views 𝒙i(b)superscriptsubscript𝒙𝑖𝑏{\bm{x}}_{i}^{(b)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT of the same sample. Then we iterate through the KNN samples 𝒙j(b)superscriptsubscript𝒙𝑗𝑏{\bm{x}}_{j}^{(b)}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT of these existing views to check if 𝒙j(v)superscriptsubscript𝒙𝑗𝑣{\bm{x}}_{j}^{(v)}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT exists. If it does, we appended it to the KNN list of the missing view. Finally, we select the top k𝑘kitalic_k samples from the KNN list as the imputation for the missing view. If the length of the KNN list is less than k𝑘kitalic_k, then the remaining positions are filled with zeros. Along with the KNN imputation 𝒙¯¯𝒙\bar{\bm{x}}over¯ start_ARG bold_italic_x end_ARG, a missing indicator matrix 𝒎¯k×V¯𝒎superscript𝑘𝑉\bar{\bm{m}}\in\mathbb{R}^{k\times V}over¯ start_ARG bold_italic_m end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_V end_POSTSUPERSCRIPT is generated, where 1 represents a position filled with a KNN sample and 0 represents a position filled with zeros. The detailed procedures are listed in Algorithm 1, Appendix A.

3.2.2. Data Augmentation

Our framework is inspired by the denoising auto-encoder, which helps to learn robust representations by introducing noise during training. Three types of noise are designed, including Gaussian noise, random dropout, and view dropout. Gaussian noise helps prevent overfitting by introducing variability in the input data. Random dropout, functional as a regularization technique, encourages the model to learn more robust features by forcing it to rely on different subsets of the input data. View dropout is a noise specifically designed for the IMVC task. It randomly drops out one or more views from the input data during training, encouraging the model to learn representations that are more robust to view missing conditions.

Below are the formulations of the three kinds of augmentation we used. By combining equation 17 and 18, the KIDA operation in equation 1 is obtained.

(17) 𝒙¯,𝒎¯=KNNI(𝒙,𝒎𝒎VD,X,M),P(𝒎VD=0)=ϕ1formulae-sequencesuperscript¯𝒙superscript¯𝒎𝐾𝑁𝑁𝐼𝒙direct-product𝒎subscript𝒎𝑉𝐷𝑋𝑀𝑃subscript𝒎𝑉𝐷0subscriptitalic-ϕ1\bar{\bm{x}}^{\prime},\bar{\bm{m}}^{\prime}=KNNI({\bm{x}},{\bm{m}}\odot{\bm{m}% }_{\scriptscriptstyle VD},X,M),\ P({\bm{m}}_{\scriptscriptstyle VD}=0)=\phi_{1}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG bold_italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_K italic_N italic_N italic_I ( bold_italic_x , bold_italic_m ⊙ bold_italic_m start_POSTSUBSCRIPT italic_V italic_D end_POSTSUBSCRIPT , italic_X , italic_M ) , italic_P ( bold_italic_m start_POSTSUBSCRIPT italic_V italic_D end_POSTSUBSCRIPT = 0 ) = italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
(18) 𝒙¯=(𝒙¯+ϕ2𝒏)𝒎RD,𝒏𝒩(0,1),P(𝒎RD=0)=ϕ3formulae-sequencesuperscript¯𝒙direct-productsuperscript¯𝒙subscriptitalic-ϕ2𝒏subscript𝒎𝑅𝐷formulae-sequencesimilar-to𝒏𝒩01𝑃subscript𝒎𝑅𝐷0subscriptitalic-ϕ3\bar{\bm{x}}^{\prime}=(\bar{\bm{x}}^{\prime}+\phi_{2}{\bm{n}})\odot{\bm{m}}_{% \scriptscriptstyle RD},\ {\bm{n}}\sim\mathcal{N}(0,1),\ P({\bm{m}}_{% \scriptscriptstyle RD}=0)=\phi_{3}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_n ) ⊙ bold_italic_m start_POSTSUBSCRIPT italic_R italic_D end_POSTSUBSCRIPT , bold_italic_n ∼ caligraphic_N ( 0 , 1 ) , italic_P ( bold_italic_m start_POSTSUBSCRIPT italic_R italic_D end_POSTSUBSCRIPT = 0 ) = italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

The view dropout augmentation is shown in equation 17, where random views are dropped out with a possibility of ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, before the KNN Imputation KNNI()𝐾𝑁𝑁𝐼KNNI(\cdot)italic_K italic_N italic_N italic_I ( ⋅ ) (Algorithm 1) step. The dropout mask is applied with element-wise multiplication, denoted by direct-product\odot. The masked views are regarded as missing views in both the KNN Imputation and the auto-encoder network. After the KNN Imputation, in equation 18, Gaussian noise is added to the input data to introduce variability, whose intensity is controlled by hyperparameter ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. After that, random values in the input are set to zero with a probability of ϕ3subscriptitalic-ϕ3\phi_{3}italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, which is the random dropout augmentation. Finally, the augmented input 𝒙¯superscript¯𝒙\bar{\bm{x}}^{\prime}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT along with the corresponding missing indicator matrix 𝒎¯superscript¯𝒎\bar{\bm{m}}^{\prime}over¯ start_ARG bold_italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be used for training.

3.3. Training Strategy and Loss Function

3.3.1. Loss Function

In the training process, we utilize a combination of 3 loss functions formulated as follows:

(19) L(𝒙,𝒙^,𝒛,𝒛,𝒄)=Lrec(𝒙,𝒙^)+λ1Laug(𝒛,𝒛)+λ2Lclu(𝒄)𝐿𝒙superscript^𝒙𝒛superscript𝒛𝒄subscript𝐿𝑟𝑒𝑐𝒙superscript^𝒙subscript𝜆1subscript𝐿𝑎𝑢𝑔𝒛superscript𝒛subscript𝜆2subscript𝐿𝑐𝑙𝑢𝒄L({\bm{x}},\hat{{\bm{x}}}^{\prime},{\bm{z}},{\bm{z}}^{\prime},{\bm{c}})=L_{rec% }({\bm{x}},\hat{{\bm{x}}}^{\prime})+\lambda_{1}L_{aug}({\bm{z}},{\bm{z}}^{% \prime})+\lambda_{2}L_{clu}({\bm{c}})italic_L ( bold_italic_x , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_z , bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_c ) = italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT ( bold_italic_x , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT ( bold_italic_c )

Lrecsubscript𝐿𝑟𝑒𝑐L_{rec}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT corresponds to the reconstruction loss of the auto-encoder for learning meaningful latent representations, formulated as:

(20) Lrec(𝒙,𝒙^)=v=1V(𝒙^(v)𝒙(v)2𝒎v)subscript𝐿𝑟𝑒𝑐𝒙superscript^𝒙superscriptsubscript𝑣1𝑉direct-productsuperscriptnormsuperscript^𝒙𝑣superscript𝒙𝑣2subscript𝒎𝑣L_{rec}({\bm{x}},\hat{\bm{x}}^{\prime})=\sum_{v=1}^{V}(||\hat{\bm{x}}^{\prime(% v)}-{\bm{x}}^{(v)}||^{2}\odot{\bm{m}}_{v})italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT ( bold_italic_x , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ( | | over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ ( italic_v ) end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⊙ bold_italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )

The missing indicator 𝒎vsubscript𝒎𝑣{\bm{m}}_{v}bold_italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is element-wisely multiplied so only the mean square errors of existing views are calculated. Note that during training, the network output is 𝒙^superscript^𝒙\hat{\bm{x}}^{\prime}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as the input is the augmented input 𝒙¯superscript¯𝒙\bar{\bm{x}}^{\prime}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (equation 2 and 3), so the network learns to reconstruct dropped out views with KNN hints and cross-view correlation and thus, learns to predict missing views’ information implicitly.

Laugsubscript𝐿𝑎𝑢𝑔L_{aug}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT is the embedding robustness loss, which encourages the learned representations to be consistent when augmentations are applied, promoting the robustness of the learned representations, formulated as:

(21) Laug(𝒛,𝒛)=loge𝒛i𝒛ij=1Be𝒛i𝒛jsubscript𝐿𝑎𝑢𝑔𝒛superscript𝒛𝑙𝑜𝑔superscript𝑒normsubscriptsuperscript𝒛𝑖subscript𝒛𝑖superscriptsubscript𝑗1𝐵superscript𝑒normsubscriptsuperscript𝒛𝑖subscript𝒛𝑗L_{aug}({\bm{z}},{\bm{z}}^{\prime})=-log\frac{e^{-||{\bm{z}}^{\prime}_{i}-{\bm% {z}}_{i}||}}{\sum_{j=1}^{B}e^{-||{\bm{z}}^{\prime}_{i}-{\bm{z}}_{j}||}}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = - italic_l italic_o italic_g divide start_ARG italic_e start_POSTSUPERSCRIPT - | | bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - | | bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | end_POSTSUPERSCRIPT end_ARG

Though our embedding robustness target is equivalent to minimizing the distance between 𝒛superscript𝒛{\bm{z}}^{\prime}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒛𝒛{\bm{z}}bold_italic_z (equation 2), we design Laugsubscript𝐿𝑎𝑢𝑔L_{aug}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT based on cross-entropy loss within the training mini-batch. B𝐵Bitalic_B in the equation represents the training batch size. This design simultaneously minimizes the distance between 𝒛isubscriptsuperscript𝒛𝑖{\bm{z}}^{\prime}_{i}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒛isubscript𝒛𝑖{\bm{z}}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and maximizes the distances between 𝒛isubscriptsuperscript𝒛𝑖{\bm{z}}^{\prime}_{i}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and embeddings of other samples 𝒛j,jisubscript𝒛𝑗𝑗𝑖{\bm{z}}_{j},j\neq ibold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ≠ italic_i, preventing the embedding space from collapsing.

Lclusubscript𝐿𝑐𝑙𝑢L_{clu}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT is the DEC-based clustering loss (Xie et al., 2016) for the Clustering Module (Section 3.1.4), which optimizes embeddings for clustering with gradients from high-confidence samples.

(22) Lclu(𝒄)=KL(𝒑||𝒒)=i=1Bj=1dcpijlogpijqijL_{clu}({\bm{c}})=KL({\bm{p}}^{\prime}||{\bm{q}}^{\prime})=\sum_{i=1}^{B}\sum_% {j=1}^{d_{c}}p^{\prime}_{ij}log\frac{p^{\prime}_{ij}}{q^{\prime}_{ij}}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT ( bold_italic_c ) = italic_K italic_L ( bold_italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | bold_italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_l italic_o italic_g divide start_ARG italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG

It is formulated as the KL-divergence between distribution 𝒑superscript𝒑{\bm{p}}^{\prime}bold_italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒒superscript𝒒{\bm{q}}^{\prime}bold_italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT computed from augmented input 𝒙superscript𝒙{\bm{x}}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT during training.

Hyperparameters λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT control the balance between different loss components. By jointly minimizing the three loss terms, our network can learn representations that are both informative and clustering-friendly.

3.3.2. Training Strategy

The training process is divided into two stages. In the first stage, the auto-encoder is pre-trained using Lrecsubscript𝐿𝑟𝑒𝑐L_{rec}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT and Laugsubscript𝐿𝑎𝑢𝑔L_{aug}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT, focusing on learning robust representations. Once the pre-training is complete, the Clustering Module is initialized. In the second stage, Lclusubscript𝐿𝑐𝑙𝑢L_{clu}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT is added for joint training to learn a clustering-friendly representation.

The training process is controlled by 2 hyperparameters: Epsubscript𝐸𝑝E_{p}italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, which represents the number of pre-training epochs, and Ejsubscript𝐸𝑗E_{j}italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which represents the number of joint training epochs. For a detailed description of the training process, please refer to Algorithm 2 in Appendix A.

4. Experiments

Please refer to Appendix B.1 for the hyperparameter settings and design details in our experiments.

Table 1. The statistic of 6 datasets used in our experiments.
Name Ansichten Clusters Samples Dimensions Typ
Handwritten (Duin, 2023) 6 10 2000 240/76/216/47/64/6 Image
Caltech101-7 (Cai et al., 2013) 5 7 1400 40/254/1984/512/928 Image
ALOI_Deep (Liu et al., 2023) 3 100 10800 2048/4096/2048 Image
Scene15 (Fei-Fei and Perona, 2005; Cai et al., 2013) 2 15 4485 20/59 Image
BDGP (Cai et al., 2012; Tang and Liu, 2022) 2 5 2500 1750/79 Image/Text
Reuters (Amini et al., 2009; Yang et al., 2023) 2 6 18758 10/10 Text
Table 2. Comparison of our method with state-of-the-art approaches on 6 benchmark datasets. The results are averaged on missing rates mr={0,0.25,0.5,0.75}subscript𝑚𝑟00.250.50.75m_{r}=\{0,0.25,0.5,0.75\}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { 0 , 0.25 , 0.5 , 0.75 }. The best result is highlighted in bold while the sub-optimal is underlined.
Datasets Handwritten Caltech101-7 ALOI_Deep
Metrics Acc(%) NMI(%) ARI(%) Acc(%) NMI(%) ARI(%) Acc(%) NMI(%) ARI(%)
Completer (Lin et al., 2021) 52.19±5.14plus-or-minus52.195.1452.19\pm 5.1452.19 ± 5.14 54.67±3.60plus-or-minus54.673.6054.67\pm 3.6054.67 ± 3.60 28.77±4.72plus-or-minus28.774.7228.77\pm 4.7228.77 ± 4.72 62.89±8.02plus-or-minus62.898.0262.89\pm 8.0262.89 ± 8.02 57.75±6.50plus-or-minus57.756.5057.75\pm 6.5057.75 ± 6.50 41.05±9.33plus-or-minus41.059.3341.05\pm 9.3341.05 ± 9.33 44.53±2.53plus-or-minus44.532.5344.53\pm 2.5344.53 ± 2.53 75.45±1.18plus-or-minus75.451.1875.45\pm 1.1875.45 ± 1.18 26.47±2.10plus-or-minus26.472.1026.47\pm 2.1026.47 ± 2.10
DSIMVC (Tang and Liu, 2022) 75.76±4.04plus-or-minus75.764.0475.76\pm 4.0475.76 ± 4.04 71.32±2.77plus-or-minus71.322.7771.32\pm 2.7771.32 ± 2.77 63.17±4.11plus-or-minus63.174.1163.17\pm 4.1163.17 ± 4.11 70.06±3.95plus-or-minus70.063.9570.06\pm 3.9570.06 ± 3.95 59.56±2.58plus-or-minus59.562.5859.56\pm 2.5859.56 ± 2.58 52.06±3.52plus-or-minus52.063.5252.06\pm 3.5252.06 ± 3.52 72.58±2.24plus-or-minus72.582.2472.58\pm 2.2472.58 ± 2.24 91.21±0.67plus-or-minus91.210.6791.21\pm 0.6791.21 ± 0.67 70.15±2.10plus-or-minus70.152.1070.15\pm 2.1070.15 ± 2.10
SURE (Yang et al., 2023) 66.46±6.81plus-or-minus66.466.8166.46\pm 6.8166.46 ± 6.81 61.74±4.59plus-or-minus61.744.5961.74\pm 4.5961.74 ± 4.59 50.37±6.38plus-or-minus50.376.3850.37\pm 6.3850.37 ± 6.38 66.97±5.94plus-or-minus66.975.9466.97\pm 5.9466.97 ± 5.94 54.37±4.92plus-or-minus54.374.9254.37\pm 4.9254.37 ± 4.92 46.86±6.71plus-or-minus46.866.7146.86\pm 6.7146.86 ± 6.71 50.08±5.96plus-or-minus50.085.9650.08\pm 5.9650.08 ± 5.96 86.39±1.83plus-or-minus86.391.8386.39\pm 1.8386.39 ± 1.83 40.66±7.56plus-or-minus40.667.5640.66\pm 7.5640.66 ± 7.56
DCP (Lin et al., 2023) 59.80±6.32plus-or-minus59.806.3259.80\pm 6.3259.80 ± 6.32 62.73±3.92plus-or-minus62.733.9262.73\pm 3.9262.73 ± 3.92 45.40±8.21plus-or-minus45.408.2145.40\pm 8.2145.40 ± 8.21 55.26±10.75plus-or-minus55.2610.7555.26\pm 10.7555.26 ± 10.75 54.44±9.59plus-or-minus54.449.5954.44\pm 9.5954.44 ± 9.59 40.75±13.03plus-or-minus40.7513.0340.75\pm 13.0340.75 ± 13.03 56.89±5.34plus-or-minus56.895.3456.89\pm 5.3456.89 ± 5.34 86.82±2.01plus-or-minus86.822.0186.82\pm 2.0186.82 ± 2.01 50.66±9.00plus-or-minus50.669.0050.66\pm 9.0050.66 ± 9.00
CPSPAN (Jin et al., 2023) 84.19±5.43plus-or-minus84.195.4384.19\pm 5.4384.19 ± 5.43 81.49±2.65plus-or-minus81.492.6581.49\pm 2.6581.49 ± 2.65 76.06±4.79plus-or-minus76.064.7976.06\pm 4.7976.06 ± 4.79 82.07±4.41plus-or-minus82.074.4182.07\pm 4.4182.07 ± 4.41 73.04±4.19plus-or-minus73.044.1973.04\pm 4.1973.04 ± 4.19 68.55±5.73plus-or-minus68.555.7368.55\pm 5.7368.55 ± 5.73 73.01±4.27plus-or-minus73.014.2773.01\pm 4.2773.01 ± 4.27 92.40±1.07plus-or-minus92.401.0792.40\pm 1.0792.40 ± 1.07 71.79±3.78plus-or-minus71.793.7871.79\pm 3.7871.79 ± 3.78
RecFormer (Liu et al., 2023) 90.46±1.33plus-or-minus90.461.3390.46\pm 1.3390.46 ± 1.33 82.67 ±plus-or-minus\pm± 0.99 80.12±1.81plus-or-minus80.121.8180.12\pm 1.8180.12 ± 1.81 71.07±2.53plus-or-minus71.072.5371.07\pm 2.5371.07 ± 2.53 63.87±2.49plus-or-minus63.872.4963.87\pm 2.4963.87 ± 2.49 56.77±3.09plus-or-minus56.773.0956.77\pm 3.0956.77 ± 3.09 86.57±1.54plus-or-minus86.571.5486.57\pm 1.5486.57 ± 1.54 96.84±0.33plus-or-minus96.840.3396.84\pm 0.3396.84 ± 0.33 86.01±1.55plus-or-minus86.011.5586.01\pm 1.5586.01 ± 1.55
URRL-IMVC (ours) 93.86 ±plus-or-minus\pm± 2.51 89.90 ±plus-or-minus\pm± 1.38 88.93 ±plus-or-minus\pm± 2.58 93.19 ±plus-or-minus\pm± 1.54 86.94 ±plus-or-minus\pm± 1.46 86.45 ±plus-or-minus\pm± 1.98 91.48 ±plus-or-minus\pm± 2.21 97.50 ±plus-or-minus\pm± 1.04 90.91 ±plus-or-minus\pm± 2.70
Datasets Scene15 BDGP Reuters
Metrics Acc(%) NMI(%) ARI(%) Acc(%) NMI(%) ARI(%) Acc(%) NMI(%) ARI(%)
Completer (Lin et al., 2021) 38.39±1.96plus-or-minus38.391.9638.39\pm 1.9638.39 ± 1.96 42.09 ±plus-or-minus\pm± 1.54 22.86±1.78plus-or-minus22.861.7822.86\pm 1.7822.86 ± 1.78 54.91±5.99plus-or-minus54.915.9954.91\pm 5.9954.91 ± 5.99 46.89±4.62plus-or-minus46.894.6246.89\pm 4.6246.89 ± 4.62 22.20±6.30plus-or-minus22.206.3022.20\pm 6.3022.20 ± 6.30 38.68±4.02plus-or-minus38.684.0238.68\pm 4.0238.68 ± 4.02 22.04±4.44plus-or-minus22.044.4422.04\pm 4.4422.04 ± 4.44 8.26±4.12plus-or-minus8.264.128.26\pm 4.128.26 ± 4.12
DSIMVC (Tang and Liu, 2022) 31.63±1.22plus-or-minus31.631.2231.63\pm 1.2231.63 ± 1.22 35.50±0.74plus-or-minus35.500.7435.50\pm 0.7435.50 ± 0.74 17.48 ±plus-or-minus\pm± 0.67 94.63 ±plus-or-minus\pm± 1.53 85.62 ±plus-or-minus\pm± 2.29 87.53 ±plus-or-minus\pm± 2.74 44.07±2.91plus-or-minus44.072.9144.07\pm 2.9144.07 ± 2.91 33.27 ±plus-or-minus\pm± 2.10 23.69 ±plus-or-minus\pm± 2.20
SURE (Yang et al., 2023) 37.83±1.83plus-or-minus37.831.8337.83\pm 1.8337.83 ± 1.83 37.62±0.80plus-or-minus37.620.8037.62\pm 0.8037.62 ± 0.80 21.03±0.93plus-or-minus21.030.9321.03\pm 0.9321.03 ± 0.93 60.48±9.91plus-or-minus60.489.9160.48\pm 9.9160.48 ± 9.91 40.41±9.37plus-or-minus40.419.3740.41\pm 9.3740.41 ± 9.37 34.68±10.60plus-or-minus34.6810.6034.68\pm 10.6034.68 ± 10.60 46.68 ±plus-or-minus\pm± 3.63 26.26±3.32plus-or-minus26.263.3226.26\pm 3.3226.26 ± 3.32 20.57±2.16plus-or-minus20.572.1620.57\pm 2.1620.57 ± 2.16
DCP (Lin et al., 2023) 38.28 ±plus-or-minus\pm± 1.63 41.69 ±plus-or-minus\pm± 1.23 22.22 ±plus-or-minus\pm± 1.70 50.98±5.90plus-or-minus50.985.9050.98\pm 5.9050.98 ± 5.90 44.50±6.50plus-or-minus44.506.5044.50\pm 6.5044.50 ± 6.50 18.67±6.87plus-or-minus18.676.8718.67\pm 6.8718.67 ± 6.87 38.60±3.29plus-or-minus38.603.2938.60\pm 3.2938.60 ± 3.29 21.79±4.84plus-or-minus21.794.8421.79\pm 4.8421.79 ± 4.84 7.12±3.80plus-or-minus7.123.807.12\pm 3.807.12 ± 3.80
CPSPAN (Jin et al., 2023) 37.71±2.33plus-or-minus37.712.3337.71\pm 2.3337.71 ± 2.33 41.38 ±plus-or-minus\pm± 2.04 22.68±1.84plus-or-minus22.681.8422.68\pm 1.8422.68 ± 1.84 76.93±9.26plus-or-minus76.939.2676.93\pm 9.2676.93 ± 9.26 63.25±7.62plus-or-minus63.257.6263.25\pm 7.6263.25 ± 7.62 59.91±10.32plus-or-minus59.9110.3259.91\pm 10.3259.91 ± 10.32 39.78±2.02plus-or-minus39.782.0239.78\pm 2.0239.78 ± 2.02 14.55±2.07plus-or-minus14.552.0714.55\pm 2.0714.55 ± 2.07 12.47±1.54plus-or-minus12.471.5412.47\pm 1.5412.47 ± 1.54
RecFormer (Liu et al., 2023) 33.37±1.39plus-or-minus33.371.3933.37\pm 1.3933.37 ± 1.39 35.31±0.94plus-or-minus35.310.9435.31\pm 0.9435.31 ± 0.94 17.45±0.79plus-or-minus17.450.7917.45\pm 0.7917.45 ± 0.79 51.89±2.92plus-or-minus51.892.9251.89\pm 2.9251.89 ± 2.92 40.46±2.69plus-or-minus40.462.6940.46\pm 2.6940.46 ± 2.69 19.52±2.33plus-or-minus19.522.3319.52\pm 2.3319.52 ± 2.33 41.43±3.59plus-or-minus41.433.5941.43\pm 3.5941.43 ± 3.59 18.38±2.25plus-or-minus18.382.2518.38\pm 2.2518.38 ± 2.25 15.91±2.23plus-or-minus15.912.2315.91\pm 2.2315.91 ± 2.23
URRL-IMVC (ours) 41.18 ±plus-or-minus\pm± 1.77 41.87 ±plus-or-minus\pm± 0.95 24.09 ±plus-or-minus\pm± 1.09 89.15 ±plus-or-minus\pm± 5.07 77.14 ±plus-or-minus\pm± 6.25 77.64 ±plus-or-minus\pm± 8.03 48.63 ±plus-or-minus\pm± 2.64 28.94 ±plus-or-minus\pm± 1.65 24.78 ±plus-or-minus\pm± 2.59

4.1. Datasets and Metrics

Experiments were performed on 6 multi-view datasets varying in number of views and modal to validate the effectiveness of our method. The dataset characteristics are summarized in Table 1. We report the widely used metrics Clustering Accuracy (Acc), Normalized Mutual Information (NMI), and Adjusted Rand Index(ARI) as results. We run each experiment 10 times and report the average value and standard deviation (after ±plus-or-minus\pm±). Details about our experiment and view-missing settings can be found in Appendix B.2.

4.2. Comparison with State-of-the-arts

We compare our approach with several state-of-the-art DIMVC approaches listed in Table 2. Other comparisons about different numbers of views, traditional IMVC methods, model parameters, and computational costs can be found in Appendix B.3.

Comparison on different datasets

URRL-IMVC achieved state-of-the-art performance on the 6 benchmark datasets, surpassing most existing approaches, as indicated in Table 2. Our approach consistently outperformed other SOTA methods across all evaluation metrics, except for the BDGP dataset and NMI on Scene15 and Reuters, where our approach is sub-optimal. This excellent clustering performance and stability can be attributed to our representation learning framework, which effectively captures the underlying data structure while remaining robust in the presence of missing views. Additionally, URRL-IMVC exhibited stability compared to other SOTA methods, with a relatively lower standard deviation across 10 experiments, thanks to the tailored components in the network to filter out noise. Notably, our approach excelled on datasets with more views, such as Handwritten and Caltech101-7. Together with experimental results in Table 5, it showed that our framework successfully overcomes the drawbacks of cross-view contrastive learning.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4. Comparison with state-of-the-art approaches under different missing conditions on the Caltech101-7 dataset. The performance of each approach is reported using fold lines.
\Description

Our approach consistently outperforms other approaches under different missing rates.

Comparison with different missing rates

As depicted in Figure 4c, URRL-IMVC consistently outperformed other approaches, establishing an upper bound for clustering performance regardless of the missing rate (mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT). Our approach displayed better stability compared to other methods, with a gradual decrease in accuracy as the missing rate increased. In contrast, other approaches exhibited more fluctuation, rendering their results less predictable. Notably, DCP and Completer experienced a significant decline in performance when the missing rate reached 0.75, as they only trained their cross-view contrastive and recovery networks using complete samples. Insufficient training samples led to unsatisfactory recovery outcomes and fragile representations for clustering. In contrast, our approach focused on the robustness of the unified representation, allowing us to circumvent these limitations and achieve stable and high performance across varying missing rates.

4.3. Ablation Studies

Unless otherwise specified, the experiments were conducted on the Caltech101-7 dataset with the missing rate mr=0.5subscript𝑚𝑟0.5m_{r}=0.5italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.5. In certain experiments, the Clustering Module was disabled to provide clearer observations of specific phenomena. Additional ablation studies regarding detailed designs and hyperparameters (e.g., output choice, k𝑘kitalic_k in KNN Imputation, view dropout probability) can be found in Appendix B.4.

4.3.1. Ablation on Modules

Table 3. Ablation study on our designed modules. We begin with the baseline model, which is a simple Transformer-based auto-encoder. Then different combinations of modules are incorporated to evaluate their contributions. “KNNI”: KNN Imputation; “Aug”: data augmentation and robustness loss; “CDPE&TAM”: CDPE and TAM described in NDE and VDE; “CM”: clustering module and clustering loss.
KNNI Aug CDPE&TAM CM Acc(%) NMI(%) ARI(%)
76.93±4.35plus-or-minus76.934.3576.93\pm 4.3576.93 ± 4.35 64.79 ±plus-or-minus\pm± 2.18 56.52 ±plus-or-minus\pm± 3.99
83.68±2.79plus-or-minus83.682.7983.68\pm 2.7983.68 ± 2.79 72.76 ±plus-or-minus\pm± 2.21 68.60 ±plus-or-minus\pm± 4.48
83.96 ±plus-or-minus\pm± 3.11 74.39 ±plus-or-minus\pm± 2.24 70.41 ±plus-or-minus\pm± 3.63
85.60 ±plus-or-minus\pm± 6.52 78.10 ±plus-or-minus\pm± 4.46 75.29 ±plus-or-minus\pm± 7.14
84.77±3.59plus-or-minus84.773.5984.77\pm 3.5984.77 ± 3.59 74.92 ±plus-or-minus\pm± 2.54 71.76 ±plus-or-minus\pm± 3.69
82.05 ±plus-or-minus\pm± 3.71 70.74 ±plus-or-minus\pm± 3.44 63.92 ±plus-or-minus\pm± 6.82
89.95 ±plus-or-minus\pm± 0.74 80.01 ±plus-or-minus\pm± 0.96 79.55 ±plus-or-minus\pm± 1.19
89.39 ±plus-or-minus\pm± 3.12 81.74 ±plus-or-minus\pm± 1.81 80.60 ±plus-or-minus\pm± 2.76
85.46±1.42plus-or-minus85.461.4285.46\pm 1.4285.46 ± 1.42 75.25 ±plus-or-minus\pm± 2.02 71.69 ±plus-or-minus\pm± 2.44
90.22 ±plus-or-minus\pm± 5.20 83.21 ±plus-or-minus\pm± 3.89 82.15 ±plus-or-minus\pm± 5.67
91.90 ±plus-or-minus\pm± 2.99 84.29 ±plus-or-minus\pm± 2.06 83.76 ±plus-or-minus\pm± 3.96
93.35 ±plus-or-minus\pm± 0.37 86.50 ±plus-or-minus\pm± 0.61 86.25 ±plus-or-minus\pm± 0.64

In Table 3, we present the results of our ablation study on the main modules we designed. First of all, our Unified auto-encoder framework sets a solid baseline. Then, our designed robustness strategies, KNN Imputation (KNNI), and Data Augmentation (Aug) significantly improve clustering performance and have approximately equal contributions. The Clustering Module (CM) also plays a vital role in some datasets, by learning clustering-friendly representations. However, directly applying it can result in unstable performance, as the DEC-based training is sensitive to initialization. While our tailored components, i.e., CDPE&TAM help stabilize the learning. To summarize, the ablation study on modules reveals that the KNN Imputation, Augmentation, and Clustering Module are the three key components for improving clustering performance, while CDPE&TAM is essential for stability.

4.3.2. Visualization

Refer to caption
(a) Raw data (38.43)
Refer to caption
(b) 200 iteration (73.43)
Refer to caption
(c) 1600 iteration (89.86)
Refer to caption
(d) 2200 iteration (87.86)
Refer to caption
(e) 2400 iteration (90.14)
Refer to caption
(f) 4400 iteration (94.14)
Figure 5. T-SNE visualization of the embeddings during the training process on the Caltech101-7 dataset. The iteration number and corresponding accuracy are recorded below each sub-figure. The training process consists of 4400 iterations, with the Clustering Module initialized at 2200 iterations.
\Description

The first half of iterations capture the inherent data structure and provide a good initialization for the Clustering Module.

Figure 5f presents a T-SNE visualization of the embeddings during one training process. Initially in Figure 5a, the multi-view raw data are concatenated as embeddings, and the visualization appears to be disorganized. After 200 iterations of training, in Figure 5b, inherent structures start to be captured, and the pre-training peak accuracy (89.86) occurs at 1600 iterations 5c. At 2200 iterations 5d, the Clustering Module is initialized, and joint training with DEC-based clustering loss commences. The clusters become more compact after 200 iterations of joint training, as depicted in Figure 5e. Finally, at the end of the training, as shown in Figure 5f, the clusters become very compact, numerous samples initially incorrectly clustered with low confidence are now corrected, and the accuracy reaches 94.14.

5. Conclusion

In this paper, we proposed URRL-IMVC, a novel unified and robust representation learning framework for the incomplete multi-view clustering task. By leveraging an attention-based auto-encoder framework, we successfully fuse the multi-view information into a unified embedding, offering a more comprehensive solution compared to potentially limiting cross-view contrastive learning. Through the utilization of KNN imputation and data augmentation strategies, we directly acquire robust embeddings that effectively handle the view-missing condition, eliminating the need for explicit missing view recovery and its associated computation and unreliability. Furthermore, incremental improvements, such as the Clustering Module and customization of the Encoder, enhance the clustering stability and performance, achieving state-of-the-art results. This improved robust and unified representation learning framework acts as a powerful tool for addressing the challenges of IMVC and provides valuable insights for future research in this domain.

Acknowledgements.
This work was supported by the Zhejiang Provincial Natural Science Foundation of China under Grant No. LDT23F01013F01, and by the Fundamental Research Funds for the Central Universities.

References

  • (1)
  • Amini et al. (2009) Massih R. Amini, Nicolas Usunier, and Cyril Goutte. 2009. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. In Advances in Neural Information Processing Systems, Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta (Eds.), Vol. 22. Curran Associates, Inc.
  • Cai et al. (2013) Xiao Cai, Feiping Nie, and Heng Huang. 2013. Multi-view k-means clustering on big data. In Twenty-Third International Joint conference on artificial intelligence.
  • Cai et al. (2012) Xiao Cai, Hua Wang, Heng Huang, and Chris Ding. 2012. Joint stage recognition and anatomical annotation of drosophila gene expression patterns. Bioinformatics 28, 12 (2012), i16–i24.
  • Chao et al. (2021) Guoqing Chao, Shiliang Sun, and Jinbo Bi. 2021. A Survey on Multiview Clustering. IEEE Transactions on Artificial Intelligence 2, 2 (2021), 146–168. https://doi.org/10.1109/TAI.2021.3065894
  • Chen et al. (2022) Man-Sheng Chen, Jia-Qi Lin, Xiang-Long Li, Bao-Yu Liu, Chang-Dong Wang, Dong Huang, and Jian-Huang Lai. 2022. Representation learning in multi-view clustering: A literature review. Data Science and Engineering 7, 3 (2022), 225–241.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
  • Duin (2023) Robert Duin. 2023. Multiple Features. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5HC70.
  • Fei-Fei and Perona (2005) L. Fei-Fei and P. Perona. 2005. A Bayesian hierarchical model for learning natural scene categories. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2. 524–531 vol. 2. https://doi.org/10.1109/CVPR.2005.16
  • Fu et al. (2020) Lele Fu, Pengfei Lin, Athanasios V Vasilakos, and Shiping Wang. 2020. An overview of recent multi-view clustering. Neurocomputing 402 (2020), 148–161.
  • Gao et al. (2016) Hang Gao, Yuxing Peng, and Songlei Jian. 2016. Incomplete Multi-view Clustering. In Intelligent Information Processing VIII, Zhongzhi Shi, Sunil Vadera, and Gang Li (Eds.). Springer International Publishing, Cham, 245–255.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc.
  • He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16000–16009.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Huang et al. (2014) Peihao Huang, Yan Huang, Wei Wang, and Liang Wang. 2014. Deep Embedding Network for Clustering. In 2014 22nd International Conference on Pattern Recognition. 1532–1537. https://doi.org/10.1109/ICPR.2014.272
  • Jin et al. (2023) Jiaqi Jin, Siwei Wang, Zhibin Dong, Xinwang Liu, and En Zhu. 2023. Deep Incomplete Multi-View Clustering With Cross-View Partial Sample and Prototype Alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11600–11609.
  • Li et al. (2023) Haobin Li, Yunfan Li, Mouxing Yang, Peng Hu, Dezhong Peng, and Xi Peng. 2023. Incomplete Multi-view Clustering via Prototype-based Imputation. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, Edith Elkind (Ed.). International Joint Conferences on Artificial Intelligence Organization, 3911–3919. https://doi.org/10.24963/ijcai.2023/435 Main Track.
  • Li et al. (2014) Shao-Yuan Li, Yuan Jiang, and Zhi-Hua Zhou. 2014. Partial Multi-View Clustering. Proceedings of the AAAI Conference on Artificial Intelligence 28, 1 (Jun. 2014). https://doi.org/10.1609/aaai.v28i1.8973
  • Lin et al. (2022) Fangfei Lin, Bing Bai, Kun Bai, Yazhou Ren, Peng Zhao, and Zenglin Xu. 2022. Contrastive multi-view hyperbolic hierarchical clustering. arXiv preprint arXiv:2205.02618 (2022).
  • Lin et al. (2023) Yijie Lin, Yuanbiao Gou, Xiaotian Liu, Jinfeng Bai, Jiancheng Lv, and Xi Peng. 2023. Dual Contrastive Prediction for Incomplete Multi-View Representation Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2023), 4447–4461. https://doi.org/10.1109/TPAMI.2022.3197238
  • Lin et al. (2021) Yijie Lin, Yuanbiao Gou, Zitao Liu, Boyun Li, Jiancheng Lv, and Xi Peng. 2021. COMPLETER: Incomplete Multi-View Clustering via Contrastive Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11174–11183.
  • Liu et al. (2023) Chengliang Liu, Jie Wen, Zhihao Wu, Xiaoling Luo, Chao Huang, and Yong Xu. 2023. Information Recovery-Driven Deep Incomplete Multiview Clustering Network. IEEE Transactions on Neural Networks and Learning Systems (2023), 1–11. https://doi.org/10.1109/TNNLS.2023.3286918
  • Liu et al. (2017) Xinwang Liu, Miaomiao Li, Lei Wang, Yong Dou, Jianping Yin, and En Zhu. 2017. Multiple Kernel k-Means with Incomplete Kernels. Proceedings of the AAAI Conference on Artificial Intelligence 31, 1 (Feb. 2017). https://doi.org/10.1609/aaai.v31i1.10893
  • Nguyen et al. (2021) Xuan-Bac Nguyen, Duc Toan Bui, Chi Nhan Duong, Tien D. Bui, and Khoa Luu. 2021. Clusformer: A Transformer Based Clustering Approach to Unsupervised Large-Scale Face and Visual Landmark Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10847–10856.
  • Tang and Liu (2022) Huayi Tang and Yong Liu. 2022. Deep Safe Incomplete Multi-view Clustering: Theorem and Algorithm. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 21090–21110.
  • Trosten et al. (2023) Daniel J. Trosten, Sigurd Løkse, Robert Jenssen, and Michael C. Kampffmeyer. 2023. On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 23976–23985.
  • Tu et al. (2021) Wenxuan Tu, Sihang Zhou, Xinwang Liu, Xifeng Guo, Zhiping Cai, En Zhu, and Jieren Cheng. 2021. Deep Fusion Clustering Network. Proceedings of the AAAI Conference on Artificial Intelligence 35, 11 (May 2021), 9978–9987. https://doi.org/10.1609/aaai.v35i11.17198
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
  • Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of the 25th International Conference on Machine Learning (Helsinki, Finland) (ICML ’08). Association for Computing Machinery, New York, NY, USA, 1096–1103. https://doi.org/10.1145/1390156.1390294
  • Wang et al. (2018) Qianqian Wang, Zhengming Ding, Zhiqiang Tao, Quanxue Gao, and Yun Fu. 2018. Partial Multi-view Clustering via Consistent GAN. In 2018 IEEE International Conference on Data Mining (ICDM). 1290–1295. https://doi.org/10.1109/ICDM.2018.00174
  • Wang et al. (2021) Qianqian Wang, Zhengming Ding, Zhiqiang Tao, Quanxue Gao, and Yun Fu. 2021. Generative Partial Multi-View Clustering With Adaptive Fusion and Cycle Consistency. IEEE Transactions on Image Processing 30 (2021), 1771–1783. https://doi.org/10.1109/TIP.2020.3048626
  • Wang et al. (2022) Yiming Wang, Dongxia Chang, Zhiqiang Fu, Jie Wen, and Yao Zhao. 2022. Graph Contrastive Partial Multi-View Clustering. IEEE Transactions on Multimedia (2022), 1–12. https://doi.org/10.1109/TMM.2022.3210376
  • Wang et al. (2019) Zhongdao Wang, Liang Zheng, Yali Li, and Shengjin Wang. 2019. Linkage Based Face Clustering via Graph Convolution Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Wen et al. (2023) Jie Wen, Zheng Zhang, Lunke Fei, Bob Zhang, Yong Xu, Zhao Zhang, and Jinxing Li. 2023. A Survey on Incomplete Multiview Clustering. IEEE Transactions on Systems, Man, and Cybernetics: Systems 53, 2 (2023), 1136–1149. https://doi.org/10.1109/TSMC.2022.3192635
  • Xie et al. (2016) Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 48), Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA, 478–487.
  • Xu et al. (2019) Cai Xu, Ziyu Guan, Wei Zhao, Hongchang Wu, Yunfei Niu, and Beilei Ling. 2019. Adversarial Incomplete Multi-view Clustering. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 3933–3939. https://doi.org/10.24963/ijcai.2019/546
  • Xu et al. (2021) Jie Xu, Yazhou Ren, Huayi Tang, Xiaorong Pu, Xiaofeng Zhu, Ming Zeng, and Lifang He. 2021. Multi-VAE: Learning Disentangled View-Common and View-Peculiar Visual Representations for Multi-View Clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9234–9243.
  • Xu et al. (2022) Jie Xu, Huayi Tang, Yazhou Ren, Liang Peng, Xiaofeng Zhu, and Lifang He. 2022. Multi-Level Feature Learning for Contrastive Multi-View Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16051–16060.
  • Yang et al. (2020) Lei Yang, Dapeng Chen, Xiaohang Zhan, Rui Zhao, Chen Change Loy, and Dahua Lin. 2020. Learning to Cluster Faces via Confidence and Connectivity Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Yang et al. (2023) Mouxing Yang, Yunfan Li, Peng Hu, Jinfeng Bai, Jiancheng Lv, and Xi Peng. 2023. Robust Multi-View Clustering With Incomplete Information. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2023), 1055–1069. https://doi.org/10.1109/TPAMI.2022.3155499
  • Zhang and Zhu (2023) Yanhao Zhang and Changming Zhu. 2023. Incomplete multi-view clustering via attention-based contrast learning. International Journal of Machine Learning and Cybernetics (2023), 1–17.

Appendix A Method Appendix

The detailed KNN Imputation algorithm is described in Algorithm 1, and the training procedure is described in Algorithm 2.

Algorithm 1 Procedure of KNN Imputation

Input: Target 𝒙i(v)superscriptsubscript𝒙𝑖𝑣{\bm{x}}_{i}^{(v)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT, which is the v𝑣vitalic_vth view of the i𝑖iitalic_ith sample, dataset X𝑋Xitalic_X, missing indicator matrix M𝑀Mitalic_M, hyperparameter k𝑘kitalic_k.

1:  if Miv=1subscript𝑀𝑖𝑣1M_{iv}=1italic_M start_POSTSUBSCRIPT italic_i italic_v end_POSTSUBSCRIPT = 1 then
2:     # The view exists
3:     Return 𝒙i(v)superscriptsubscript𝒙𝑖𝑣{\bm{x}}_{i}^{(v)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT’s KNN
4:  else
5:     # The view is missing
6:     a=0𝑎0a=0italic_a = 0, create empty KNN list
7:     while a<k𝑎𝑘a<kitalic_a < italic_k do
8:        # Traverse k𝑘kitalic_k neighbors
9:        b=1𝑏1b=1italic_b = 1
10:        while b<=V𝑏𝑉b<=Vitalic_b < = italic_V do
11:           # Traverse all views
12:           if b=v𝑏𝑣b=vitalic_b = italic_v oder Mib=0subscript𝑀𝑖𝑏0M_{ib}=0italic_M start_POSTSUBSCRIPT italic_i italic_b end_POSTSUBSCRIPT = 0 then
13:              # b𝑏bitalic_bth view of target sample 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is missing
14:              pass
15:           else
16:              # b𝑏bitalic_bth view of target sample 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT exists
17:              Find a𝑎aitalic_ath nearest neighbor of 𝒙i(b)superscriptsubscript𝒙𝑖𝑏{\bm{x}}_{i}^{(b)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT, denoted as 𝒙j(b)superscriptsubscript𝒙𝑗𝑏{\bm{x}}_{j}^{(b)}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT
18:              if Mjv=0subscript𝑀𝑗𝑣0M_{jv}=0italic_M start_POSTSUBSCRIPT italic_j italic_v end_POSTSUBSCRIPT = 0 then
19:                 # v𝑣vitalic_vth view of neighbor 𝒙jsubscript𝒙𝑗{\bm{x}}_{j}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is missing
20:                 pass
21:              else
22:                 # v𝑣vitalic_vth view of neighbor 𝒙jsubscript𝒙𝑗{\bm{x}}_{j}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT exists
23:                 hinzufügen 𝒙j(v)superscriptsubscript𝒙𝑗𝑣{\bm{x}}_{j}^{(v)}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT to KNN list
24:              end if
25:           end if
26:           b=b+1𝑏𝑏1b=b+1italic_b = italic_b + 1
27:        end while
28:        a=a+1𝑎𝑎1a=a+1italic_a = italic_a + 1
29:     end while
30:     if KNN list length <kabsent𝑘<k< italic_k then
31:        Pad KNN list with zeros to length k𝑘kitalic_k
32:     else
33:        Choose top k𝑘kitalic_k from KNN list
34:     end if
35:     Return KNN list
36:  end if

Output: KNN Imputation 𝒙¯i(v)superscriptsubscript¯𝒙𝑖𝑣\bar{\bm{x}}_{i}^{(v)}over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT

Algorithm 2 Training process of URRL-IMVC

Input: Dataset X𝑋Xitalic_X, missing indicator matrix M𝑀Mitalic_M, hyperparameters.

1:  Initialize model parameters 𝜽Esubscript𝜽𝐸{\bm{\theta}}_{E}bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, 𝜽Dsubscript𝜽𝐷{\bm{\theta}}_{D}bold_italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, and 𝜽Csubscript𝜽𝐶{\bm{\theta}}_{C}bold_italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT. Pre-compute KNN-search results. epoch = 0, iteration per epoch = Iesubscript𝐼𝑒I_{e}italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT
2:  while epoch<Ep𝑒𝑝𝑜𝑐subscript𝐸𝑝epoch<E_{p}italic_e italic_p italic_o italic_c italic_h < italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT do
3:     # Stage 1: Pre-training
4:     iteration = 0
5:     while iteration<Ie𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛subscript𝐼𝑒iteration<I_{e}italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n < italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT do
6:        Pre-process: KNN Imputation and Data Augmentation by equation 17 and 18, and obtain processed data 𝒙¯¯𝒙\bar{\bm{x}}over¯ start_ARG bold_italic_x end_ARG, 𝒙¯superscript¯𝒙\bar{\bm{x}}^{\prime}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and processed mask 𝒎¯¯𝒎\bar{\bm{m}}over¯ start_ARG bold_italic_m end_ARG, 𝒎¯superscript¯𝒎\bar{\bm{m}}^{\prime}over¯ start_ARG bold_italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
7:        Forward: network forward by equation 2, 3, and obtain embedding 𝒛superscript𝒛{\bm{z}}^{\prime}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝒛𝒛{\bm{z}}bold_italic_z and reconstruction 𝒙^superscript^𝒙\hat{\bm{x}}^{\prime}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
8:        Loss: Compute loss by equation 19, in which Lclusubscript𝐿𝑐𝑙𝑢L_{clu}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT is ignored, i.e., λ2=0subscript𝜆20\lambda_{2}=0italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.
9:        Backward: Loss backward and update model parameters 𝜽Esubscript𝜽𝐸{\bm{\theta}}_{E}bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and 𝜽Dsubscript𝜽𝐷{\bm{\theta}}_{D}bold_italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT.
10:        iteration = iteration + 1
11:     end while
12:     epoch = epoch + 1
13:  end while
14:  Initialize cluster centers 𝜽Csubscript𝜽𝐶{\bm{\theta}}_{C}bold_italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT.
15:  while epoch<Ep+Ej𝑒𝑝𝑜𝑐subscript𝐸𝑝subscript𝐸𝑗epoch<E_{p}+E_{j}italic_e italic_p italic_o italic_c italic_h < italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT do
16:     # Stage 2: Joint Training
17:     iteration = 0
18:     while iteration<Ie𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛subscript𝐼𝑒iteration<I_{e}italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n < italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT do
19:        Pre-process: KNN Imputation and Data Augmentation by equation 17 and 18, and obtain processed data 𝒙¯¯𝒙\bar{\bm{x}}over¯ start_ARG bold_italic_x end_ARG, 𝒙¯superscript¯𝒙\bar{\bm{x}}^{\prime}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and processed mask 𝒎¯¯𝒎\bar{\bm{m}}over¯ start_ARG bold_italic_m end_ARG, 𝒎¯superscript¯𝒎\bar{\bm{m}}^{\prime}over¯ start_ARG bold_italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
20:        Forward: network forward by equation 2, 3, and obtain embedding 𝒛superscript𝒛{\bm{z}}^{\prime}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝒛𝒛{\bm{z}}bold_italic_z, reconstruction 𝒙^superscript^𝒙\hat{\bm{x}}^{\prime}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and clustering result 𝒄superscript𝒄{\bm{c}}^{\prime}bold_italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
21:        Loss: Compute loss by equation 19.
22:        Backward: Loss backward and update model parameters 𝜽Esubscript𝜽𝐸{\bm{\theta}}_{E}bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, 𝜽Dsubscript𝜽𝐷{\bm{\theta}}_{D}bold_italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, and 𝜽Csubscript𝜽𝐶{\bm{\theta}}_{C}bold_italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT.
23:        iteration = iteration + 1
24:     end while
25:     epoch = epoch + 1
26:  end while

Output: Model parameters 𝜽Esubscript𝜽𝐸{\bm{\theta}}_{E}bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, 𝜽Dsubscript𝜽𝐷{\bm{\theta}}_{D}bold_italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, 𝜽Csubscript𝜽𝐶{\bm{\theta}}_{C}bold_italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, final clustering result 𝒄𝒄{\bm{c}}bold_italic_c

Appendix B Experiments Appendix

B.1. Implementation Details

We set most of the hyperparameters empirically with grid search, and the same setting is used for all experiments if not specifically mentioned. γ𝛾\gammaitalic_γ in equation 13 is set to -10. The hyperparameter k𝑘kitalic_k in KNN Imputation is set to 4. ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and ϕ3subscriptitalic-ϕ3\phi_{3}italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, the data augmentation hyperparameter in equation 18, are fixed at 0.05, while ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in equation 17 which controls view dropout possibility is set to be growing with the actual missing rate of the dataset, defined as,

(23) ϕ1=ϵ+(1ϵ)×(1i=0Nj=0VMijN×V)2subscriptitalic-ϕ1italic-ϵ1italic-ϵsuperscript1superscriptsubscript𝑖0𝑁superscriptsubscript𝑗0𝑉subscript𝑀𝑖𝑗𝑁𝑉2\phi_{1}=\epsilon+(1-\epsilon)\times(1-\frac{\sum_{i=0}^{N}\sum_{j=0}^{V}M_{ij% }}{N\times V})^{2}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϵ + ( 1 - italic_ϵ ) × ( 1 - divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_N × italic_V end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

in which we set ϵ=0.15italic-ϵ0.15\epsilon=0.15italic_ϵ = 0.15. The loss weight hyperparameters λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are set to 0.001 and 0.1 respectively. The embedding dimension desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is set to 256. Batch size B𝐵Bitalic_B is set to 64 for both training and testing and the learning rate is fixed at 3e-4 throughout training. A small weight decay of 4e-5 is used for less over-fitting. The training epoch parameter Epsubscript𝐸𝑝E_{p}italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (Section 3.3.2) is set to 100, 100, 15, and 50 respectively for the four datasets in Table 1 to maintain roughly the same training iteration. As for Ejsubscript𝐸𝑗E_{j}italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we found that training with DEC-based loss on some datasets (ALOI_Deep and Scene15 in this paper) can diverge, possibly due to imbalanced cluster size. For these datasets, we simply skip the second stage’s joint training, i.e., Ej=0subscript𝐸𝑗0E_{j}=0italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0, while for other datasets Ej=Epsubscript𝐸𝑗subscript𝐸𝑝E_{j}=E_{p}italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

PReLU (He et al., 2015) is used as the activation function in the VDE and the Decoder. Dropout is not used in any modules of our network. Agglomerative clustering with “ward” linkage is used to initialize cluster centers in the Clustering Module.

Table 4. The ablation study on view dropout augmentation probability ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from equation 17. We use grid search to determine the best value range of ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under different missing rates mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and try to design a mapping function from the actual missing rate to the desired ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Generally speaking, a larger missing rate requires a larger view dropout probability for augmentation, the ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT value from the designed mapping function, equation 23, is listed in the last column of the table.
Parameter ϕ1=0subscriptitalic-ϕ10\phi_{1}=0italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ϕ1=0.15subscriptitalic-ϕ10.15\phi_{1}=0.15italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.15 ϕ1=0.3subscriptitalic-ϕ10.3\phi_{1}=0.3italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.3 ϕ1=0.45subscriptitalic-ϕ10.45\phi_{1}=0.45italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.45 ϕ1=0.6subscriptitalic-ϕ10.6\phi_{1}=0.6italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.6 Equation 23
mr=0.00subscript𝑚𝑟0.00m_{r}=0.00italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.00 89.30±1.86plus-or-minus89.301.8689.30\pm 1.8689.30 ± 1.86 89.50±1.77plus-or-minus89.501.7789.50\pm 1.7789.50 ± 1.77 88.91±1.89plus-or-minus88.911.8988.91\pm 1.8988.91 ± 1.89 89.71 ±plus-or-minus\pm± 1.19 86.94±2.12plus-or-minus86.942.1286.94\pm 2.1286.94 ± 2.12 ϕ1=0.15subscriptitalic-ϕ10.15\phi_{1}=0.15italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.15
mr=0.25subscript𝑚𝑟0.25m_{r}=0.25italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.25 86.15±2.42plus-or-minus86.152.4286.15\pm 2.4286.15 ± 2.42 88.12±1.90plus-or-minus88.121.9088.12\pm 1.9088.12 ± 1.90 88.56 ±plus-or-minus\pm± 1.55 86.51±3.33plus-or-minus86.513.3386.51\pm 3.3386.51 ± 3.33 83.57±4.74plus-or-minus83.574.7483.57\pm 4.7483.57 ± 4.74 ϕ1=0.17subscriptitalic-ϕ10.17\phi_{1}=0.17italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.17
mr=0.50subscript𝑚𝑟0.50m_{r}=0.50italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.50 83.35±1.94plus-or-minus83.351.9483.35\pm 1.9483.35 ± 1.94 86.12 ±plus-or-minus\pm± 1.57 85.61±2.08plus-or-minus85.612.0885.61\pm 2.0885.61 ± 2.08 83.25±3.86plus-or-minus83.253.8683.25\pm 3.8683.25 ± 3.86 82.25±4.82plus-or-minus82.254.8282.25\pm 4.8282.25 ± 4.82 ϕ1=0.23subscriptitalic-ϕ10.23\phi_{1}=0.23italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.23
mr=0.75subscript𝑚𝑟0.75m_{r}=0.75italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.75 81.41±2.79plus-or-minus81.412.7981.41\pm 2.7981.41 ± 2.79 81.78±3.86plus-or-minus81.783.8681.78\pm 3.8681.78 ± 3.86 82.89 ±plus-or-minus\pm± 4.51 80.82±3.58plus-or-minus80.823.5880.82\pm 3.5880.82 ± 3.58 77.17±5.63plus-or-minus77.175.6377.17\pm 5.6377.17 ± 5.63 ϕ1=0.32subscriptitalic-ϕ10.32\phi_{1}=0.32italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.32
mr=1.00subscript𝑚𝑟1.00m_{r}=1.00italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 1.00 76.31±3.00plus-or-minus76.313.0076.31\pm 3.0076.31 ± 3.00 77.53±3.63plus-or-minus77.533.6377.53\pm 3.6377.53 ± 3.63 77.02±3.91plus-or-minus77.023.9177.02\pm 3.9177.02 ± 3.91 80.21 ±plus-or-minus\pm± 2.94 79.19±3.67plus-or-minus79.193.6779.19\pm 3.6779.19 ± 3.67 ϕ1=0.46subscriptitalic-ϕ10.46\phi_{1}=0.46italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.46
Table 5. Comparison between our approach and cross-view contrastive learning-based approach (CPSPAN) on Caltech101-7 dataset with different numbers of views. The best results of CPSPAN are achieved with 4 views, while with 5 views for our approach.
Ansichten CPSPAN (Jin et al., 2023) URRL-IMVC (ours)
Acc(%) NMI(%) ARI(%) Acc(%) NMI(%) ARI(%)
2 50.88±1.87plus-or-minus50.881.8750.88\pm 1.8750.88 ± 1.87 45.27±2.50plus-or-minus45.272.5045.27\pm 2.5045.27 ± 2.50 35.79±2.25plus-or-minus35.792.2535.79\pm 2.2535.79 ± 2.25 58.36±3.01plus-or-minus58.363.0158.36\pm 3.0158.36 ± 3.01 47.16±2.50plus-or-minus47.162.5047.16\pm 2.5047.16 ± 2.50 39.40±2.71plus-or-minus39.402.7139.40\pm 2.7139.40 ± 2.71
3 73.17 ±plus-or-minus\pm± 4.27 61.40 ±plus-or-minus\pm± 4.29 55.37 ±plus-or-minus\pm± 5.46 77.60±0.88plus-or-minus77.600.8877.60\pm 0.8877.60 ± 0.88 67.61 ±plus-or-minus\pm± 0.98 63.97 ±plus-or-minus\pm± 1.33
4 84.89 ±plus-or-minus\pm± 2.15 75.37 ±plus-or-minus\pm± 2.45 71.79 ±plus-or-minus\pm± 3.26 91.73 ±plus-or-minus\pm± 0.47 83.57 ±plus-or-minus\pm± 0.68 83.26 ±plus-or-minus\pm± 0.76
5 77.62 ±plus-or-minus\pm± 4.74 69.70 ±plus-or-minus\pm± 4.04 63.23 ±plus-or-minus\pm± 5.51 92.95 ±plus-or-minus\pm± 2.60 86.29 ±plus-or-minus\pm± 1.76 86.02 ±plus-or-minus\pm± 2.91

B.2. Datasets and Experiments Setting

Our chosen datasets vary in views (2–6), clusters (7–100), samples (1400–18758), modal (image/text), and feature types (deep/hand-crafted), providing a comprehensive evaluation of approaches. Two parameters missing number mnsubscript𝑚𝑛m_{n}italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and missing rate mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are defined to control the missing conditions. We first select N×mr𝑁subscript𝑚𝑟N\times m_{r}italic_N × italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT samples as incomplete samples, then randomly select mnsubscript𝑚𝑛m_{n}italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT views of each incomplete sample as missing views. We fix mnsubscript𝑚𝑛m_{n}italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and vary mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in our experiments, mnsubscript𝑚𝑛m_{n}italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are fixed at 4, 3, 2, 1, 1, 1 for the 6 datasets respectively. Importantly, it should be noted that within the same set of experiments, we ensured that the input data and missing indicator matrix remained consistent across different methods, ensuring fair comparisons. For comparison with state-of-the-art methods in Table 2 we reproduce the results with their published code. Several prior works are difficult to adapt to different numbers of views, which could hinder real applications. We randomly select views when the dataset has more views than the model requires.

B.3. Comparison with State-of-the-art Methods

B.3.1. Comparison with a different number of views

As we mentioned in the introduction, the effectiveness of the cross-view contrastive learning strategy diminishes due to less overlapped information between views. Observing from Table 5, adding more views may harm the clustering performance of the cross-view contrastive learning-based approach, proving this point of view, and also being consistent with the theoretical analysis from (Trosten et al., 2023). On the other hand, our approach stably benefits from more views in the dataset, overcoming this drawback.

B.4. Ablation Studies

Table 6. Ablation test on the output choice of VDE and NDE. ”Mean” represents using the average of all output vectors from the Transformer as output. The 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT represents using the first output vector, 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT represents the second output vector, and so on. ”Concat+Linear” represents first concatenating the output vectors and then using a linear layer to map the new vector to the desired dimension.
Output/Module NDE VDE
1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT 93.36±plus-or-minus\pm±0.89 89.73±plus-or-minus\pm±3.26
2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT 91.87±plus-or-minus\pm±1.01 88.75±plus-or-minus\pm±3.56
3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT 90.47±plus-or-minus\pm±3.15 87.23±plus-or-minus\pm±3.67
4thsuperscript4𝑡4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT 89.36±plus-or-minus\pm±4.33 89.94±plus-or-minus\pm±4.15
5thsuperscript5𝑡5^{th}5 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT - 87.14±plus-or-minus\pm±3.88
Mean 90.09±plus-or-minus\pm±3.61 93.36±plus-or-minus\pm±0.98
Concat+Linear 83.23±plus-or-minus\pm±7.16 91.13±plus-or-minus\pm±3.38

B.4.1. Ablation on Output Choice

We conducted the ablation test in Table 6 to find the best output choice of both the Neighbor Dimensional Encoder (NDE, section 3.1.1) and the View Dimensional Encoder (VDE, section 3.1.2). It can be observed that choosing the first vector of the Transformer output sequence significantly outperforms other choices, and using the latter output vectors results in worse and worse performance. It is consistent with our point of view that NDE needs a bias on the most confident input (the center sample or the nearest neighbor), and further neighbors contain more noise to harm the final performance. For VDE the situation is different, using the average of all output vectors outperforms other choices, which is consistent with our point of view that VDE needs to be unbiased. Concatenation with linear layer does not perform well in both Encoders, possibly due to lack of supervision.

B.4.2. Ablation on View Dropout Probability ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

We conducted a grid search to determine the best value range of view dropout augmentation probability ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under different missing rates mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and the results are shown in Table 4. For view complete condition, ϕ1<0.6subscriptitalic-ϕ10.6\phi_{1}<0.6italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 0.6 have similar performance, while for view incomplete condition, the desired ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ascends as the missing rate mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT increases. According to this observation, we designed the mapping function in equation 23 to follow this ascending trend, and its value is listed in the last column of the table.

Table 7. Ablation test on the hyperparameter k𝑘kitalic_k in KNN imputation. The result is unimodal with the best k=4𝑘4k=4italic_k = 4. Larger k𝑘kitalic_k values tend to provide more stable results (smaller standard deviation).
k=1𝑘1k=1italic_k = 1 k=2𝑘2k=2italic_k = 2 k=4𝑘4k=4italic_k = 4 k=8𝑘8k=8italic_k = 8 k=16𝑘16k=16italic_k = 16
83.84±3.38plus-or-minus83.843.3883.84\pm 3.3883.84 ± 3.38 84.84±2.82plus-or-minus84.842.8284.84\pm 2.8284.84 ± 2.82 87.31 ±plus-or-minus\pm± 2.01 87.00±2.46plus-or-minus87.002.4687.00\pm 2.4687.00 ± 2.46 86.01±1.28plus-or-minus86.011.2886.01\pm 1.2886.01 ± 1.28

B.4.3. Ablation on hyperparameter k𝑘kitalic_k for KNN

We conduct an ablation study on k𝑘kitalic_k in KNN Imputation (3.2.1) to examine its effect. A large increment can be observed comparing k=4𝑘4k=4italic_k = 4 with k=1𝑘1k=1italic_k = 1. However, the performance starts to drop as k>4𝑘4k>4italic_k > 4, which we infer can be caused by the noise brought by further neighbors. On the other hand, larger k𝑘kitalic_k also seems to benefit the stability of clustering.