DM-MIMO: Diffusion Models for Robust Semantic Communications over MIMO Channels

Yiheng Duan, Tong Wu, Zhiyong Chen and Meixia Tao
Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, China
Emails: {duanyiheng, wu_tong, zhiyongchen, mxtao}@sjtu.edu.cn
This work is supported by the NSF of China under grant 62125108 and 62222111.
Abstract

This paper investigates robust semantic communications over multiple-input multiple-output (MIMO) fading channels. Current semantic communications over MIMO channels mainly focus on channel adaptive encoding and decoding, which lacks exploration of signal distribution. To leverage the potential of signal distribution in signal space denoising, we develop a diffusion model over MIMO channels (DM-MIMO), a plug-in module at the receiver side in conjunction with singular value decomposition (SVD) based precoding and equalization. Specifically, due to the significant variations in effective noise power over distinct sub-channels, we determine the effective sampling steps accordingly and devise a joint sampling algorithm. Utilizing a three-stage training algorithm, DM-MIMO learns the distribution of the encoded signal, which enables noise elimination over all sub-channels. Experimental results demonstrate that the DM-MIMO effectively reduces the mean square errors (MSE) of the equalized signal and the DM-MIMO semantic communication system (DM-MIMO-JSCC) outperforms the JSCC-based semantic communication system in image reconstruction.

Index Terms:
Semantic communications, multiple-input multiple-output (MIMO), diffusion models (DMs).

I Introduction

Recently, semantic communications have attracted extensive attention thanks to their great potential in improving transmission efficiency. By leveraging the rapid advancements in deep learning, semantic communications can adeptly extract and transmit meaningful semantic information through neural network (NN) based joint source-channel coding (JSCC), and have demonstrated superiority over traditional bit communications in various types of source transmissions [1, 2, 3]. Thus far, semantic communications are regarded as a highly promising technique for 6G wireless communication networks and beyond [4].

Despite the great potential of semantic communications, most existing works primarily focus on single-input single-output (SISO) channels. It is thus of great importance and need to investigate semantic communications over multiple-input multiple output (MIMO) channels, given that MIMO has played a leading role in boosting channel capacity and transmission reliability since 3G wireless communications. The main distinction between SISO and MIMO channels in semantic communications lies in how to allocate semantic information over sub-channels with the aid of channel state information (CSI). To this end, in [5], the proposed DeepJSCC-MIMO adopts SVD-based precoding and equalization, constructing a channel-condition-based heatmap as an additional input for encoding and decoding. In [6], in addition to SVD-based precoding and equalization, channel and feature attention (CFA) modules are embedded in the JSCC encoder and decoder to adapt to MIMO channel conditions. However, despite the above channel adaptive encoding and decoding, performing signal denoising also holds the potential to further enhance the performance of semantic communications under MIMO channels, which requires further investigation.

As an advanced type of generative models, diffusion models (DMs) not only achieve great success in image generation [7] but also show advancements in image [8] and audio [9] restoration. Introducing information decay to source data in the forward diffusion process by adding noise, DMs are trained to learn the decay with NNs. With noise of different power captured in different steps, DMs are not only capable of sample generation but also available for signal denoising. For example, recently, channel denoising diffusion models (CDDM) are proposed to mitigate the impact of channel noise in SISO channels with the adaptive forward diffusion and the corresponding reverse sampling process [10].

Refer to caption
Figure 1: Architecture of DM-MIMO-JSCC.

Inspired by the above, we develop a diffusion model over MIMO channels (DM-MIMO) as a plug-in module at the receiver, eliminating noise and enhancing signal quality through learning signal distribution, thus further improving the performance of semantic communication systems over MIMO channels. Through SVD-based precoding and equalization, MIMO channels are decomposed into parallel sub-channels, each with different effective noise power. The existing DMs only consider a fixed noise power in each sampling step, thereby failing to adapt to varying channel conditions across sub-channels. Given the effective noise power over different sub-channels, DM-MIMO employs different effective sampling steps correspondingly. Based on these effective sampling steps, in order to maintain the correct distribution properties of the input of each sampling step, DM-MIMO applies a joint sampling algorithm, adjusting the equalized signal through either noise addition or the reverse sampling process Moreover, employing a three-stage training algorithm, DM-MIMO is able to learn the distribution of the encoded signal, which enhances the performance of signal denoising and reduces power fluctuations. Utilizing these training and sampling algorithms, the proposed DM-MIMO enhances the robustness of the semantic communication system across a wide range of channel noise power. Additionally, as a plug-in module, DM-MIMO is independent of the structure of JSCC, allowing for flexible implementation in semantic communication systems.

We evaluate the performance of DM-MIMO through extensive experiments. With DM-MIMO, significant reduction of mean square error (MSE) between the decoder input signal and the encoded signal is achieved. This reduction indicates an enhancement in signal quality, thereby enhancing image recovery. As a result, the DM-MIMO semantic communication system (DM-MIMO-JSCC) outperforms the existing JSCC-based semantic communication system in terms of peak signal to noise ratio (PSNR).

II Preliminary of DM and System Overview

II-A Preliminary of DM

DMs achieve data generation through a progressive denoising procedure. With the forward diffusion process gradually corrupting data by adding noise, the reverse sampling process of DMs performs the opposite procedure, producing samples sharing the same distribution as the source data. Specifically, for a given data 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with distribution q(𝐗0)𝑞subscript𝐗0q(\mathbf{X}_{0})italic_q ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), the forward diffusion process is derived as a Markov chain, generating a sequence of random variables 𝐗1,𝐗2,,𝐗Tsubscript𝐗1subscript𝐗2subscript𝐗𝑇\mathbf{X}_{1},\mathbf{X}_{2},\cdots,\mathbf{X}_{T}bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT modeled by

q(𝐗1:T|𝐗0)t=1Tq(𝐗t|𝐗t1),𝑞conditionalsubscript𝐗:1𝑇subscript𝐗0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝐗𝑡subscript𝐗𝑡1q\left(\mathbf{X}_{1:T}|\mathbf{X}_{0}\right)\triangleq\prod_{t=1}^{T}q\left(% \mathbf{X}_{t}|\mathbf{X}_{t-1}\right),italic_q ( bold_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≜ ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (1)

where T𝑇Titalic_T is the number of diffusion steps, and q(𝐗t|𝐗t1)𝑞conditionalsubscript𝐗𝑡subscript𝐗𝑡1q(\mathbf{X}_{t}|\mathbf{X}_{t-1})italic_q ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) denotes the conditional distribution of step t𝑡titalic_t, formulated as

q(𝐗t|𝐗t1)𝒩(𝐗t;αt𝐗t1,(1αt)𝐈).𝑞conditionalsubscript𝐗𝑡subscript𝐗𝑡1𝒩subscript𝐗𝑡subscript𝛼𝑡subscript𝐗𝑡11subscript𝛼𝑡𝐈q(\mathbf{X}_{t}|\mathbf{X}_{t-1})\triangleq\mathcal{N}\left(\mathbf{X}_{t};% \sqrt{\alpha_{t}}\mathbf{X}_{t-1},(1-\alpha_{t})\mathbf{I}\right).italic_q ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ≜ caligraphic_N ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) . (2)

Here, αt(0,1)subscript𝛼𝑡01\alpha_{t}\in(0,1)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ) is the noise schedule chosen ahead of model training. The reverse sampling process starts by sampling a pure Gaussian noise 𝐗Tsubscript𝐗𝑇\mathbf{X}_{T}bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT consists of independent and identically distributed (i.i.d.) elements with distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ), and then gradually generates the target data by

p𝜽(𝐗0)=p(𝐗T)t=1Tp𝜽(𝐗t1|𝐗t),subscript𝑝𝜽subscript𝐗0𝑝subscript𝐗𝑇subscriptsuperscriptproduct𝑇𝑡1subscript𝑝𝜽conditionalsubscript𝐗𝑡1subscript𝐗𝑡p_{{\boldsymbol{\theta}}}(\mathbf{X}_{0})=p(\mathbf{X}_{T})\prod^{T}_{t=1}p_{{% \boldsymbol{\theta}}}(\mathbf{X}_{t-1}|\mathbf{X}_{t}),italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_p ( bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (3)

where 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ are the learnable parameters of DMs, and

p𝜽(𝐗t1|𝐗t)=𝒩(𝐗t1;μ𝜽(𝐗t,t),Σ𝜽(𝐗t,t)).subscript𝑝𝜽conditionalsubscript𝐗𝑡1subscript𝐗𝑡𝒩subscript𝐗𝑡1subscript𝜇𝜽subscript𝐗𝑡𝑡subscriptΣ𝜽subscript𝐗𝑡𝑡p_{\boldsymbol{\theta}}(\mathbf{X}_{t-1}|\mathbf{X}_{t})=\mathcal{N}\left(% \mathbf{X}_{t-1};\mu_{{\boldsymbol{\theta}}}(\mathbf{X}_{t},t),\Sigma_{{% \boldsymbol{\theta}}}(\mathbf{X}_{t},t)\right).italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) . (4)

Here μ𝜽subscript𝜇𝜽\mu_{\boldsymbol{\theta}}italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT and Σ𝜽subscriptΣ𝜽\Sigma_{\boldsymbol{\theta}}roman_Σ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT denotes the mean and variance of the conditional distribution p𝜽(𝐗t1|𝐗t)subscript𝑝𝜽conditionalsubscript𝐗𝑡1subscript𝐗𝑡p_{\boldsymbol{\theta}}(\mathbf{X}_{t-1}|\mathbf{X}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) predicted by DMs. As such, DMs are capable of noise elimination. We can first simulate different noise addition process by selecting different diffusion steps, then adopt corresponding sampling steps to recover the source samples. Leveraging the denoising capability of DMs, we propose a plug-in denoising module DM-MIMO, detailed in Section III.

II-B System overview

Consider a semantic communication system performing image transmission over MIMO channels. For simplicity, we assume both the transmitter and receiver are equipped with M𝑀Mitalic_M antennas. The input image signal is represented by 𝐒h×w×3𝐒superscript𝑤3\mathbf{S}\in\mathbb{R}^{h\times w\times 3}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT, where hhitalic_h and w𝑤witalic_w denote the height and width of the input image respectively, while 3 is the quantity of color channels. The image is transmitted over k𝑘kitalic_k channel uses, and the channel bandwidth ratio (CBR) is defined as CBR=k/nCBR𝑘𝑛\text{CBR}=k/nCBR = italic_k / italic_n, where n=3hw𝑛3𝑤n=3hwitalic_n = 3 italic_h italic_w.

As illustrated in Fig. 1, at the transmitter, the input image first adopts a JSCC encoding function fϕsubscript𝑓bold-italic-ϕf_{{\boldsymbol{\phi}}}italic_f start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT parameterized by ϕbold-italic-ϕ{\boldsymbol{\phi}}bold_italic_ϕ to output the encoded signal 𝐙=[𝐳1,,𝐳M]TM×k𝐙superscriptsubscript𝐳1subscript𝐳𝑀𝑇superscript𝑀𝑘\mathbf{Z}=[\mathbf{z}_{1},\cdots,\mathbf{z}_{M}]^{T}\in\mathbb{C}^{M\times k}bold_Z = [ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_k end_POSTSUPERSCRIPT, expressed as 𝐙=fϕ(𝐒)𝐙subscript𝑓bold-italic-ϕ𝐒\mathbf{Z}=f_{{\boldsymbol{\phi}}}(\mathbf{S})bold_Z = italic_f start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_S ). The encoded signal 𝐙𝐙\mathbf{Z}bold_Z is then mapped into the channel input signal 𝐖M×k𝐖superscript𝑀𝑘\mathbf{W}\in\mathbb{C}^{M\times k}bold_W ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_k end_POSTSUPERSCRIPT via MIMO precoding. The power constraint Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is given by 1k𝐖F2Ps1𝑘subscriptsuperscriptnorm𝐖2𝐹subscript𝑃𝑠\frac{1}{k}\left\|\mathbf{W}\right\|^{2}_{F}\leq P_{s}divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∥ bold_W ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

We consider Rayleigh block fading MIMO channels. Let 𝐇M×M𝐇superscript𝑀𝑀\mathbf{H}\in\mathbb{C}^{M\times M}bold_H ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT be the channel matrix which remains unchanged over k𝑘kitalic_k channel uses. Each entry of 𝐇𝐇\mathbf{H}bold_H is i.i.d. random variables following the complex normal distribution 𝒞𝒩(0,1)𝒞𝒩01\mathcal{CN}(0,1)caligraphic_C caligraphic_N ( 0 , 1 ). Then, the output signal of the MIMO channel can be written as

𝐘=𝐇𝐖+𝐍,𝐘𝐇𝐖𝐍\mathbf{Y}=\mathbf{H}\mathbf{W}+\mathbf{N},bold_Y = bold_HW + bold_N , (5)

where 𝐍M×k𝐍superscript𝑀𝑘\mathbf{N}\in\mathbb{C}^{M\times k}bold_N ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_k end_POSTSUPERSCRIPT is the additive noise term that consists of i.i.d. elements with distribution 𝒞𝒩(0,σ2)𝒞𝒩0superscript𝜎2\mathcal{CN}(0,\sigma^{2})caligraphic_C caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), in which σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the channel noise power.

As CSI is accessible to both the transmitter and receiver, we adopt SVD-based precoding and equalization to leverage spatial multiplexing and mitigate inter-channel interference. The channel matrix 𝐇𝐇\mathbf{H}bold_H can be decomposed as

𝐇=𝐔𝚺𝐕H,𝐇𝐔𝚺superscript𝐕𝐻\mathbf{H}=\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^{H},bold_H = bold_U bold_Σ bold_V start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , (6)

where 𝐔=[𝐮1,,𝐮M]M×M𝐔subscript𝐮1subscript𝐮𝑀superscript𝑀𝑀\mathbf{U}=[\mathbf{u}_{1},\cdots,\mathbf{u}_{M}]\in\mathbb{C}^{M\times M}bold_U = [ bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_u start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT and 𝐕M×M𝐕superscript𝑀𝑀\mathbf{V}\in\mathbb{C}^{M\times M}bold_V ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT are unitary matrices, and 𝚺=diag[λ1,,λM]M×M𝚺diagsubscript𝜆1subscript𝜆𝑀superscript𝑀𝑀\boldsymbol{\Sigma}={\rm diag}[\lambda_{1},\cdots,\lambda_{M}]\in\mathbb{C}^{M% \times M}bold_Σ = roman_diag [ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_λ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT is a diagonal matrix with singular values λ1λMsubscript𝜆1subscript𝜆𝑀\lambda_{1}\geq\cdots\geq\lambda_{M}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ ⋯ ≥ italic_λ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. Applying 𝐕𝐕\mathbf{V}bold_V as the precoder at the transmitter, we have 𝐖=𝐕𝐙𝐖𝐕𝐙\mathbf{W}=\mathbf{V}\mathbf{Z}bold_W = bold_VZ, thus (5) becomes

𝐘=𝐇𝐕𝐙+𝐍=𝐔𝚺𝐙+𝐍.𝐘𝐇𝐕𝐙𝐍𝐔𝚺𝐙𝐍\displaystyle\mathbf{Y}=\mathbf{H}\mathbf{V}\mathbf{Z}+\mathbf{N}=\mathbf{U}% \boldsymbol{\Sigma}\mathbf{Z}+\mathbf{N}.bold_Y = bold_HVZ + bold_N = bold_U bold_Σ bold_Z + bold_N . (7)

With the aid of SVD-based equalizer, the equalized signal 𝐘=𝚺𝐔H𝐘superscript𝐘superscript𝚺superscript𝐔𝐻𝐘\mathbf{Y}^{\prime}=\boldsymbol{\Sigma}^{\dagger}\mathbf{U}^{H}\mathbf{Y}bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_Σ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT bold_Y at the receiver can be expressed as

𝐘=𝐙+𝐍,superscript𝐘𝐙superscript𝐍\mathbf{Y}^{\prime}=\mathbf{Z}+\mathbf{N}^{\prime},bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_Z + bold_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (8)

where 𝐍=𝚺𝐔H𝐍=[𝐧1,,𝐧M]Tsuperscript𝐍superscript𝚺superscript𝐔𝐻𝐍superscriptsubscriptsuperscript𝐧1subscriptsuperscript𝐧𝑀𝑇\mathbf{N}^{\prime}=\boldsymbol{\Sigma}^{\dagger}\mathbf{U}^{H}\mathbf{N}=[% \mathbf{n}^{\prime}_{1},\cdots,\mathbf{n}^{\prime}_{M}]^{T}bold_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_Σ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT bold_N = [ bold_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and 𝚺=diag[1λ1,,1λM]superscript𝚺diag1subscript𝜆11subscript𝜆𝑀\boldsymbol{\Sigma}^{\dagger}={\rm diag}[\frac{1}{\lambda_{1}},\cdots,\frac{1}% {\lambda_{M}}]bold_Σ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = roman_diag [ divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , ⋯ , divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ]. Consequently, after employing SVD-based precoding and equalization, the MIMO channels are decomposed into M𝑀Mitalic_M parallel sub-channels. For sub-channel i𝑖iitalic_i, the effective noise power is σi2=σ2λi2superscriptsubscript𝜎𝑖2superscript𝜎2superscriptsubscript𝜆𝑖2\sigma_{i}^{2}=\frac{\sigma^{2}}{\lambda_{i}^{2}}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. In order to remove noise from the equalized signal 𝐘superscript𝐘\mathbf{Y}^{\prime}bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we introduce a DM-MIMO module g𝜽()subscript𝑔𝜽g_{{\boldsymbol{\theta}}}(\cdot)italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ) with parameters 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ, represented as 𝐙^=g𝜽(𝐘)^𝐙subscript𝑔𝜽superscript𝐘\hat{\mathbf{Z}}=g_{{\boldsymbol{\theta}}}(\mathbf{Y}^{\prime})over^ start_ARG bold_Z end_ARG = italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), detailed in Section III. Finally the JSCC decoding function with parameter 𝝋𝝋{\boldsymbol{\varphi}}bold_italic_φ takes the denoised signal 𝐙^M×k^𝐙superscript𝑀𝑘\hat{\mathbf{Z}}\in\mathbb{C}^{M\times k}over^ start_ARG bold_Z end_ARG ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_k end_POSTSUPERSCRIPT as input, reconstructing the input image 𝐒^=f𝝋(𝐙^)^𝐒subscript𝑓𝝋^𝐙\hat{\mathbf{S}}=f_{{\boldsymbol{\varphi}}}(\hat{\mathbf{Z}})over^ start_ARG bold_S end_ARG = italic_f start_POSTSUBSCRIPT bold_italic_φ end_POSTSUBSCRIPT ( over^ start_ARG bold_Z end_ARG ).

III Design of DM-MIMO

To enhance the robustness of semantic communications, we propose DM-MIMO, a diffusion model eliminating noise over the equalized signal. Due to the high variations of effective noise power over sub-channels, we derive noise-power-aware effective sampling steps and devise a joint sampling algorithm.

III-A Analysis of sub-channel conditions

Considering block fading, disparities in channel conditions occur over distinct sub-channels. Moreover, under varying channel conditions, the JSCC encoder and decoder jointly learn semantic feature allocation across sub-channels along with semantic feature extraction. The fluctuating channel conditions and the different significance of semantics across parallel sub-channels are crucial factors to be considered in the design of the denoising module.

As 𝐔𝐔\mathbf{U}bold_U is a unitary matrix, the noise of different sub-channels in (8) follows an independent Gaussian distribution. Moreover, the effective noise power σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in different sub-channels varies significantly due to its dependence on the channel singular value λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To illustrate the difference in effective noise power over sub-channels, we perform a Monte Carlo experiment with 1×1071superscript1071\times 10^{7}1 × 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT samples of 2×2222\times 22 × 2 MIMO Rayleigh fading channels. As shown in Fig. 2, the gap between the expectation of λi2superscriptsubscript𝜆𝑖2\lambda_{i}^{2}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the two sub-channels reaches 10.3710.3710.3710.37 dB, presenting a challenge in joint denoising design.

With significant gap in effective noise power between different sub-channels, the naive method of applying one sampling step for all sub-channels fails to eliminate noise effectively. Therefore, an individual effective sampling step need to be employed for each sub-channel according to its effective noise power. Another naive method to remove noise is to employ a separate DM module for each sub-channel. This, however, fails to capture the joint distribution of the encoded signals over different sub-channels. Hence, based on the effective sampling steps, we propose a joint sampling algorithm, which is detailed in the following sub-sections.

Refer to caption
Figure 2: Probability density of λisubscript𝜆𝑖{\lambda_{i}}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i{1,2})𝑖12(i\in\{1,2\})( italic_i ∈ { 1 , 2 } ) in 2×2222\times 22 × 2 MIMO.

III-B DM-MIMO

We denote the output of each sub-channel as 𝐲isubscriptsuperscript𝐲𝑖\mathbf{y}^{\prime}_{i}bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is the i𝑖iitalic_i-th row of 𝐘superscript𝐘\mathbf{Y}^{\prime}bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. To adapt to different channel matrices 𝐇𝐇\mathbf{H}bold_H, for sub-channel i𝑖iitalic_i, we employ normalization with factor σisubscript𝜎𝑖{\sigma}_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as

𝐲¯isubscriptsuperscript¯𝐲𝑖\displaystyle\bar{\mathbf{y}}^{\prime}_{i}over¯ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =11+σi2𝐲i=11+σi2𝐳i+σi1+σi2𝐮iH𝐍σ.absent11superscriptsubscript𝜎𝑖2subscriptsuperscript𝐲𝑖11superscriptsubscript𝜎𝑖2subscript𝐳𝑖subscript𝜎𝑖1superscriptsubscript𝜎𝑖2subscriptsuperscript𝐮𝐻𝑖𝐍𝜎\displaystyle=\frac{1}{\sqrt{1+{{\sigma}_{i}}^{2}}}\mathbf{y}^{\prime}_{i}=% \frac{1}{\sqrt{1+{\sigma}_{i}^{2}}}{\mathbf{z}_{i}}+\frac{{\sigma}_{i}}{\sqrt{% 1+{\sigma}_{i}^{2}}}\mathbf{u}^{H}_{i}\frac{\mathbf{N}}{\sigma}.= divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG bold_u start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_N end_ARG start_ARG italic_σ end_ARG . (9)

We define 𝐱i,0=𝐳isubscript𝐱𝑖0subscript𝐳𝑖\mathbf{x}_{i,0}=\mathbf{z}_{i}bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐗0=[𝐱1,0,,𝐱M,0]Tsubscript𝐗0superscriptsubscript𝐱10subscript𝐱𝑀0𝑇\mathbf{X}_{0}=[\mathbf{x}_{1,0},\cdots,\mathbf{x}_{M,0}]^{T}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ bold_x start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_M , 0 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and design a forward diffusion process inspired by [7]

𝐱i,t=α¯t𝐱i,0+1α¯tϵi,subscript𝐱𝑖𝑡subscript¯𝛼𝑡subscript𝐱𝑖01subscript¯𝛼𝑡subscriptbold-italic-ϵ𝑖\mathbf{x}_{i,t}=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{i,0}+\sqrt{1-\bar{\alpha}_% {t}}{\boldsymbol{\epsilon}}_{i},bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (10)

where α¯t=l=1tαlsubscript¯𝛼𝑡superscriptsubscriptproduct𝑙1𝑡subscript𝛼𝑙\bar{\alpha}_{t}=\prod_{l=1}^{t}\alpha_{l}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, t{1,2,,T}𝑡12𝑇t\in\{1,2,\cdots,T\}italic_t ∈ { 1 , 2 , ⋯ , italic_T } and ϵ=[ϵ1,,ϵM]Tbold-italic-ϵsuperscriptsubscriptbold-italic-ϵ1subscriptbold-italic-ϵ𝑀𝑇{\boldsymbol{\epsilon}}=[{\boldsymbol{\epsilon}}_{1},\cdots,{\boldsymbol{% \epsilon}}_{M}]^{T}bold_italic_ϵ = [ bold_italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_ϵ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the additive noise term consisting of i.i.d. elements following 𝒞𝒩(0,1)𝒞𝒩01\mathcal{CN}(0,1)caligraphic_C caligraphic_N ( 0 , 1 ). To handle signals over sub-channels with different effective noise power, DM-MIMO generates signals sharing the same distribution as 𝐲¯isubscriptsuperscript¯𝐲𝑖\bar{\mathbf{y}}^{\prime}_{i}over¯ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, DM-MIMO utilizes the forward diffusion process with additive noise power matching the effective noise power, enabling the corresponding sampling process for denoising. Therefore, the effective sampling step misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given by

mi=argminm{1,2,,T}|σi21α¯mα¯m|.subscript𝑚𝑖subscriptargmin𝑚12𝑇superscriptsubscript𝜎𝑖21subscript¯𝛼𝑚subscript¯𝛼𝑚m_{i}=\mathop{\mathrm{argmin}}\limits_{m\in\{1,2,\cdots,T\}}\left|{\sigma}_{i}% ^{2}-\frac{{1-\bar{\alpha}_{m}}}{{{\bar{\alpha}_{m}}}}\right|.italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_m ∈ { 1 , 2 , ⋯ , italic_T } end_POSTSUBSCRIPT | italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG | . (11)

By employing tailored effective sampling steps, DM-MIMO is able to handle different signals over different sub-channels in one sampling process, utilizing the inter-sub-channel signal correlations. Moreover, considering the channel noise power varying in a wide range, additive channel noise 𝐍superscript𝐍\mathbf{N}^{\prime}bold_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT introduces dramatic power fluctuations to the equalized signal 𝐘superscript𝐘\mathbf{Y}^{\prime}bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. DM-MIMO reduces the power fluctuations by learning the distribution of the encoded signal 𝐙𝐙\mathbf{Z}bold_Z and eliminating noise through the reverse sampling process, thus enhancing the robustness of the receiver.

L=𝐿absent\displaystyle L=italic_L = 𝔼[logp𝜽(𝐗0|𝚺)]𝔼delimited-[]subscript𝑝𝜽conditionalsubscript𝐗0𝚺\displaystyle\,\mathbb{E}\left[-\log{{p_{{\boldsymbol{\theta}}}}(\mathbf{X}_{0% }|\boldsymbol{\Sigma})}\right]blackboard_E [ - roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_Σ ) ] (12)
\displaystyle\leq 𝔼q[log(p𝜽(𝐗0:T,𝐘|𝚺)q(𝐗1:T,𝐘|𝐗0,𝚺))]subscript𝔼𝑞delimited-[]subscript𝑝𝜽subscript𝐗:0𝑇conditionalsuperscript𝐘𝚺𝑞subscript𝐗:1𝑇conditionalsuperscript𝐘subscript𝐗0𝚺\displaystyle\,\mathbb{E}_{q}\left[-\log{\left(\frac{p_{{\boldsymbol{\theta}}}% (\mathbf{X}_{0:T},\mathbf{Y}^{\prime}|\boldsymbol{\Sigma})}{q(\mathbf{X}_{1:T}% ,\mathbf{Y}^{\prime}|\mathbf{X}_{0},\boldsymbol{\Sigma})}\right)}\right]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ - roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT , bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_Σ ) end_ARG start_ARG italic_q ( bold_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Σ ) end_ARG ) ] (13)
=\displaystyle== 𝔼q[DKL(q(𝐘|𝐗0,𝚺)||p(𝐘|𝚺))L𝐘logp𝜽(𝐗0|𝐗1,𝚺)L0+DKL(q(𝐗T|𝐘,𝐗0,𝚺)||p(𝐗T|𝐘,𝚺))LT\displaystyle\,\mathbb{E}_{q}\Bigg{[}\underbrace{D_{KL}\left(q\left(\mathbf{Y}% ^{\prime}|\mathbf{X}_{0},\boldsymbol{\Sigma}\right)||p\left(\mathbf{Y}^{\prime% }|\boldsymbol{\Sigma}\right)\right)}_{L_{\mathbf{Y}^{\prime}}}-\underbrace{% \log{p_{{\boldsymbol{\theta}}}\left(\mathbf{X}_{0}|\mathbf{X}_{1},\boldsymbol{% \Sigma}\right)}}_{L_{0}}+\underbrace{D_{KL}\left(q\left(\mathbf{X}_{T}|\mathbf% {Y}^{\prime},\mathbf{X}_{0},\boldsymbol{\Sigma}\right)||p\left(\mathbf{X}_{T}|% \mathbf{Y}^{\prime},\boldsymbol{\Sigma}\right)\right)}_{L_{T}}blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ under⏟ start_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Σ ) | | italic_p ( bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_Σ ) ) end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT - under⏟ start_ARG roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ ) end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Σ ) | | italic_p ( bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_Σ ) ) end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+t=1TDKL(q(𝐗t1|𝐗t,𝐗0,𝚺)||p𝜽(𝐗t1|𝐗t,𝚺))Lt1].\displaystyle+\sum^{T}_{t=1}\underbrace{{D_{KL}\left(q\left(\mathbf{X}_{t-1}|% \mathbf{X}_{t},\mathbf{X}_{0},\boldsymbol{\Sigma}\right)||p_{{\boldsymbol{% \theta}}}\left(\mathbf{X}_{t-1}|\mathbf{X}_{t},\boldsymbol{\Sigma}\right)% \right)}}_{L_{t-1}}\Bigg{]}.+ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT under⏟ start_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Σ ) | | italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ ) ) end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] . (14)

 

In the training process, as 𝐱i,misubscript𝐱𝑖subscript𝑚𝑖\mathbf{x}_{i,m_{i}}bold_x start_POSTSUBSCRIPT italic_i , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐲isubscriptsuperscript𝐲𝑖\mathbf{y}^{\prime}_{i}bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT share the same distribution, DM-MIMO can be trained on the forward diffusion process of 𝐙𝐙\mathbf{Z}bold_Z instead of 𝐘superscript𝐘\mathbf{Y}^{\prime}bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Aiming to recover 𝐙𝐙{\mathbf{Z}}bold_Z through learning its distribution, the loss function L𝐿Litalic_L is defined by the variational bound on the negative log likelihood function of 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, as given in (12). Analyzing the components of loss function, L𝐘subscript𝐿superscript𝐘L_{\mathbf{Y}^{\prime}}italic_L start_POSTSUBSCRIPT bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be ignored during training as they are not related to 𝜽𝜽{{\boldsymbol{\theta}}}bold_italic_θ. Therefore, we focus on L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Lt1subscript𝐿𝑡1L_{t-1}italic_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, revealing the training goal of approximating the distribution of q(𝐗t1|𝐗t,𝐗0,𝚺)𝑞conditionalsubscript𝐗𝑡1subscript𝐗𝑡subscript𝐗0𝚺q(\mathbf{X}_{t-1}|\mathbf{X}_{t},\mathbf{X}_{0},\boldsymbol{\Sigma})italic_q ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Σ ) with p𝜽(𝐗t1|𝐗t,𝚺)subscript𝑝𝜽conditionalsubscript𝐗𝑡1subscript𝐗𝑡𝚺p_{{\boldsymbol{\theta}}}(\mathbf{X}_{t-1}|\mathbf{X}_{t},\boldsymbol{\Sigma})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ ). After re-parameterization and re-weighting, the loss function Lt1subscript𝐿𝑡1L_{t-1}italic_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT can be simplified as

𝔼𝐗0,ϵ,𝚺(ϵϵ𝜽(𝐗t,𝚺,t)22).subscript𝔼subscript𝐗0bold-italic-ϵ𝚺superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜽subscript𝐗𝑡𝚺𝑡22\mathbb{E}_{\mathbf{X}_{0},{\boldsymbol{\epsilon}},\boldsymbol{\Sigma}}\left(% \left\|{\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}}}% (\mathbf{X}_{t},\boldsymbol{\Sigma},t)\right\|_{2}^{2}\right).blackboard_E start_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_ϵ , bold_Σ end_POSTSUBSCRIPT ( ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (15)

Finally, to optimize (15) for all t{1,,T}𝑡1𝑇t\in\{1,\cdots,T\}italic_t ∈ { 1 , ⋯ , italic_T }, the loss function of DM-MIMO can be expressed as

LDMMIMO(𝜽)=𝔼𝐗0,ϵ,𝚺,t(ϵϵ𝜽(𝐗t,𝚺,t)22).subscript𝐿𝐷𝑀𝑀𝐼𝑀𝑂𝜽subscript𝔼subscript𝐗0bold-italic-ϵ𝚺𝑡superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜽subscript𝐗𝑡𝚺𝑡22L_{DM-MIMO}({{\boldsymbol{\theta}}})=\mathbb{E}_{\mathbf{X}_{0},{\boldsymbol{% \epsilon}},\boldsymbol{\Sigma},t}\left(\left\|{\boldsymbol{\epsilon}}-{% \boldsymbol{\epsilon}}_{{\boldsymbol{\theta}}}(\mathbf{X}_{t},\boldsymbol{% \Sigma},t)\right\|_{2}^{2}\right).italic_L start_POSTSUBSCRIPT italic_D italic_M - italic_M italic_I italic_M italic_O end_POSTSUBSCRIPT ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_ϵ , bold_Σ , italic_t end_POSTSUBSCRIPT ( ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (16)

The training process of DM-MIMO is detailed in Algorithm 1. With different effective sampling steps chosen according to different effective noise power σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, DM-MIMO demonstrates the ability of denoising under diverse channel conditions while utilizing fixed parameters.

Algorithm 1 Training Algorithm of DM-MIMO

Input: Encoded signal set, diffusion steps T𝑇Titalic_T and noise schedule parameter α¯tsubscript¯𝛼𝑡\bar{\alpha}_{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for t{1,,T}𝑡1𝑇t\in\{1,\cdots,T\}italic_t ∈ { 1 , ⋯ , italic_T };
Output: Trained DM-MIMO model parameters 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ;

1:while the stop condition is not met do
2:    Randomly sample 𝐙𝐙\mathbf{Z}bold_Z from encoded signal set
3:    Randomly sample t𝑡titalic_t from Uniform({1,,T})𝑈𝑛𝑖𝑓𝑜𝑟𝑚1𝑇Uniform(\{1,\cdots,T\})italic_U italic_n italic_i italic_f italic_o italic_r italic_m ( { 1 , ⋯ , italic_T } )
4:    Randomly sample 𝐇𝐇\mathbf{H}bold_H
5:    for all i=1𝑖1i=1italic_i = 1 to M𝑀Mitalic_M do
6:         Randomly sample ϵisubscriptbold-italic-ϵ𝑖{\boldsymbol{\epsilon}}_{i}bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from 𝒩(0,𝐈2k)𝒩0subscript𝐈2𝑘\mathcal{N}\left(0,\mathbf{I}_{2k}\right)caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT 2 italic_k end_POSTSUBSCRIPT )
7:    end for
8:    ϵ=[ϵ1,,ϵM]bold-italic-ϵsubscriptbold-italic-ϵ1subscriptbold-italic-ϵ𝑀{\boldsymbol{\epsilon}}=[{\boldsymbol{\epsilon}}_{1},\cdots,{\boldsymbol{% \epsilon}}_{M}]bold_italic_ϵ = [ bold_italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_ϵ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ]
9:    Generate sample 𝐗t=α¯t𝐙+1α¯tϵsubscript𝐗𝑡subscript¯𝛼𝑡𝐙1subscript¯𝛼𝑡bold-italic-ϵ\mathbf{X}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{Z}+\sqrt{1-\bar{\alpha}_{t}}{% \boldsymbol{\epsilon}}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_Z + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ
10:    Take gradient descent step: 𝜽(ϵϵ𝜽(𝐗t,𝚺,t)22)subscript𝜽superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜽subscript𝐗𝑡𝚺𝑡22\nabla_{{\boldsymbol{\theta}}}\left(\left\|{\boldsymbol{\epsilon}}-{% \boldsymbol{\epsilon}}_{{\boldsymbol{\theta}}}\left(\mathbf{X}_{t},\boldsymbol% {\Sigma},t\right)\right\|_{2}^{2}\right)∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
11:end while

III-C Sampling Algorithm of DM-MIMO

Inspired by [11], we design a joint sampling algorithm for DM-MIMO. Specifically, in the t𝑡titalic_t-th sampling step of DM-MIMO, in order to guide the denoising process of equalized signals with high effective noise power, we add noise to the equalized signals with low effective noise power and send them to the reverse sampling step. In a word, we employ either noise addition or the reverse sampling process for each sub-channel based on the value of its effective sampling step misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Refer to caption
Figure 3: The t𝑡titalic_t-th sample step of DM-MIMO joint sampling algorithm.

We begin the sampling process from sampling step t=max{m1,,mM}𝑡subscript𝑚1subscript𝑚𝑀t=\max\{m_{1},\cdots,m_{M}\}italic_t = roman_max { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. As illustrated in Fig. 3, if mi(t1)subscript𝑚𝑖𝑡1m_{i}\leq(t-1)italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ ( italic_t - 1 ), to keep the correct properties of the distribution of 𝐗t1subscript𝐗𝑡1\mathbf{X}_{t-1}bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, we derive 𝐱i,t1subscript𝐱𝑖𝑡1\mathbf{x}_{i,t-1}bold_x start_POSTSUBSCRIPT italic_i , italic_t - 1 end_POSTSUBSCRIPT by adding noise to 𝐲¯isubscriptsuperscript¯𝐲𝑖\bar{\mathbf{y}}^{\prime}_{i}over¯ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, given by

𝐱i,t1=α¯t1α¯mi𝐲¯i+1α¯t1α¯miϵi.subscript𝐱𝑖𝑡1subscript¯𝛼𝑡1subscript¯𝛼subscript𝑚𝑖subscriptsuperscript¯𝐲𝑖1subscript¯𝛼𝑡1subscript¯𝛼subscript𝑚𝑖subscriptbold-italic-ϵ𝑖\displaystyle\mathbf{x}_{i,t-1}=\sqrt{\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{% m_{i}}}}\bar{\mathbf{y}}^{\prime}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t-1}}{\bar{% \alpha}_{m_{i}}}}{\boldsymbol{\epsilon}}_{i}.bold_x start_POSTSUBSCRIPT italic_i , italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_ARG over¯ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (17)

On the other hand, if mi>(t1)subscript𝑚𝑖𝑡1m_{i}>(t-1)italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > ( italic_t - 1 ), assume the knowledge of 𝐱i,tsubscript𝐱𝑖𝑡\mathbf{x}_{i,t}bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT and 𝐱i,0subscript𝐱𝑖0\mathbf{x}_{i,0}bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT to be available, we can derive the sampling process of 𝐱i,t1subscript𝐱𝑖𝑡1\mathbf{x}_{i,t-1}bold_x start_POSTSUBSCRIPT italic_i , italic_t - 1 end_POSTSUBSCRIPT as

𝐱i,t1=α¯t1𝐱i,0+1α¯t1𝐱i,tα¯t𝐱i,01α¯t,subscript𝐱𝑖𝑡1subscript¯𝛼𝑡1subscript𝐱𝑖01subscript¯𝛼𝑡1subscript𝐱𝑖𝑡subscript¯𝛼𝑡subscript𝐱𝑖01subscript¯𝛼𝑡\mathbf{x}_{i,t-1}=\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_{i,0}+\sqrt{1-\bar{% \alpha}_{t-1}}\frac{\mathbf{x}_{i,t}-\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{i,0}}{% \sqrt{1-\bar{\alpha}_{t}}},bold_x start_POSTSUBSCRIPT italic_i , italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG divide start_ARG bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG , (18)

where 𝐱i,0subscript𝐱𝑖0\mathbf{x}_{i,0}bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT can be acquired by re-writing (10) as

𝐱i,0=1α¯t(𝐱i,t1α¯tϵi).subscript𝐱𝑖01subscript¯𝛼𝑡subscript𝐱𝑖𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝑖\mathbf{x}_{i,0}=\frac{1}{\sqrt{\bar{\alpha}_{t}}}\left(\mathbf{x}_{i,t}-\sqrt% {1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}}_{i}\right).bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (19)

In the reverse sampling process, with only 𝐱i,tsubscript𝐱𝑖𝑡\mathbf{x}_{i,t}bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT and ϵ𝜽,i(𝐗t,𝚺,t)subscriptbold-italic-ϵ𝜽𝑖subscript𝐗𝑡𝚺𝑡{\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}},i}(\mathbf{X}_{t},\boldsymbol{% \Sigma},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ , italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ , italic_t ) known to the receiver, ϵisubscriptbold-italic-ϵ𝑖{\boldsymbol{\epsilon}}_{i}bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is replaced with ϵ𝜽,i(𝐗t,𝚺,t)subscriptbold-italic-ϵ𝜽𝑖subscript𝐗𝑡𝚺𝑡{\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}},i}(\mathbf{X}_{t},\boldsymbol{% \Sigma},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ , italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ , italic_t ). Therefore, the sampling process can be expressed as

𝐱i,t1=subscript𝐱𝑖𝑡1absent\displaystyle\mathbf{x}_{i,t-1}=bold_x start_POSTSUBSCRIPT italic_i , italic_t - 1 end_POSTSUBSCRIPT = α¯t1(1α¯t(𝐱i,t1α¯tϵ𝜽,i(𝐗t,𝚺,t)))subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝐱𝑖𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜽𝑖subscript𝐗𝑡𝚺𝑡\displaystyle\,\sqrt{\bar{\alpha}_{t-1}}\left(\frac{1}{\sqrt{\bar{\alpha}_{t}}% }\left(\mathbf{x}_{i,t}-\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}}_{{% \boldsymbol{\theta}},i}\left(\mathbf{X}_{t},\boldsymbol{\Sigma},t\right)\right% )\right)square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ , italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ , italic_t ) ) )
+1α¯t1ϵ𝜽,i(𝐗t,𝚺,t).1subscript¯𝛼𝑡1subscriptbold-italic-ϵ𝜽𝑖subscript𝐗𝑡𝚺𝑡\displaystyle+\sqrt{1-\bar{\alpha}_{t-1}}{\boldsymbol{\epsilon}}_{{\boldsymbol% {\theta}},i}\left(\mathbf{X}_{t},\boldsymbol{\Sigma},t\right).+ square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ , italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ , italic_t ) . (20)

As for the last sampling step t=1𝑡1t=1italic_t = 1, DM-MIMO only predicts 𝐱i,0subscript𝐱𝑖0\mathbf{x}_{i,0}bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT with 𝐱i,1subscript𝐱𝑖1\mathbf{x}_{i,1}bold_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT, given by

𝐱i,0=1α¯1(𝐱i,11α¯1ϵ𝜽,i(𝐗1,𝚺,1)).subscript𝐱𝑖01subscript¯𝛼1subscript𝐱𝑖11subscript¯𝛼1subscriptbold-italic-ϵ𝜽𝑖subscript𝐗1𝚺1\mathbf{x}_{i,0}=\frac{1}{\sqrt{\bar{\alpha}_{1}}}\left(\mathbf{x}_{i,1}-\sqrt% {1-\bar{\alpha}_{1}}{\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}},i}\left(% \mathbf{X}_{1},\boldsymbol{\Sigma},1\right)\right).bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ , italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ , 1 ) ) . (21)

As outlined in Algorithm 2, for sub-channel i𝑖iitalic_i, by comparing misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the sampling step t𝑡titalic_t, the joint sampling algorithm either adds noise or performs the reverse sampling process, thus leveraging all the semantic information while addressing different effective noise power over different sub-channels.

Algorithm 2 Sampling Algorithm of DM-MIMO

Input: Equalized signal 𝐘superscript𝐘\mathbf{Y}^{\prime}bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, channel matrix 𝐇𝐇\mathbf{H}bold_H, channel noise power σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT;
Output: Denoised signal 𝐙^^𝐙\hat{\mathbf{Z}}over^ start_ARG bold_Z end_ARG;

1:[𝐲1,,𝐲M]=𝐘subscriptsuperscript𝐲1subscriptsuperscript𝐲𝑀superscript𝐘[\mathbf{y}^{\prime}_{1},\cdots,\mathbf{y}^{\prime}_{M}]=\mathbf{Y}^{\prime}[ bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] = bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
2:for all i=1𝑖1i=1italic_i = 1 to M𝑀Mitalic_M do
3:    Calculate λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on 𝐇𝐇\mathbf{H}bold_H and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
4:    𝐲¯i=11+σi2𝐲isubscriptsuperscript¯𝐲𝑖11superscriptsubscript𝜎𝑖2subscriptsuperscript𝐲𝑖\bar{\mathbf{y}}^{\prime}_{i}=\frac{1}{\sqrt{1+{{\sigma}_{i}}^{2}}}\mathbf{y}^% {\prime}_{i}over¯ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
5:    Calculate misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT according to sub-channel σisubscript𝜎𝑖{\sigma}_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
6:end for
7:mmax=max{m1,,mM}subscript𝑚𝑚𝑎𝑥subscript𝑚1subscript𝑚𝑀m_{max}=\max\{m_{1},\cdots,m_{M}\}italic_m start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = roman_max { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }
8:t=mmax𝑡subscript𝑚𝑚𝑎𝑥t=m_{max}italic_t = italic_m start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT
9:for all i=1𝑖1i=1italic_i = 1 to M𝑀Mitalic_M do
10:    Randomly sample ϵisubscriptbold-italic-ϵ𝑖{\boldsymbol{\epsilon}}_{i}bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from 𝒩(0,𝐈2k)𝒩0subscript𝐈2𝑘\mathcal{N}\left(0,\mathbf{I}_{2k}\right)caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT 2 italic_k end_POSTSUBSCRIPT )
11:    𝐱i,t=α¯tα¯mi𝐲¯i+1α¯tα¯miϵisubscript𝐱𝑖𝑡subscript¯𝛼𝑡subscript¯𝛼subscript𝑚𝑖subscriptsuperscript¯𝐲𝑖1subscript¯𝛼𝑡subscript¯𝛼subscript𝑚𝑖subscriptbold-italic-ϵ𝑖\mathbf{x}_{i,t}=\sqrt{\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{m_{i}}}}\bar{% \mathbf{y}}^{\prime}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{m_{i}}}% }{\boldsymbol{\epsilon}}_{i}bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_ARG over¯ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
12:end for
13:for t=mmax,mmax1,,2𝑡subscript𝑚𝑚𝑎𝑥subscript𝑚𝑚𝑎𝑥12t=m_{max},m_{max}-1,\cdots,2italic_t = italic_m start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - 1 , ⋯ , 2 do
14:    𝐗t=[𝐱1,t,,𝐱M,t]subscript𝐗𝑡subscript𝐱1𝑡subscript𝐱𝑀𝑡\mathbf{X}_{t}=[\mathbf{x}_{1,t},\cdots,\mathbf{x}_{M,t}]bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_M , italic_t end_POSTSUBSCRIPT ]
15:    for all i=1𝑖1i=1italic_i = 1 to M𝑀Mitalic_M do
16:         if mit1subscript𝑚𝑖𝑡1m_{i}\leq t-1italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_t - 1 then
17:             Randomly sample ϵisubscriptbold-italic-ϵ𝑖{\boldsymbol{\epsilon}}_{i}bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from 𝒩(0,𝐈2k)𝒩0subscript𝐈2𝑘\mathcal{N}\left(0,\mathbf{I}_{2k}\right)caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT 2 italic_k end_POSTSUBSCRIPT )
18:             𝐱i,t1=α¯t1α¯mi𝐲¯i+1α¯t1α¯miϵisubscript𝐱𝑖𝑡1subscript¯𝛼𝑡1subscript¯𝛼subscript𝑚𝑖subscriptsuperscript¯𝐲𝑖1subscript¯𝛼𝑡1subscript¯𝛼subscript𝑚𝑖subscriptbold-italic-ϵ𝑖\mathbf{x}_{i,t-1}=\sqrt{\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{m_{i}}}}\bar{% \mathbf{y}}^{\prime}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{m_{i}% }}}{\boldsymbol{\epsilon}}_{i}bold_x start_POSTSUBSCRIPT italic_i , italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_ARG over¯ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
19:         else
20:             ϵ^i=ϵ𝜽,i(𝐗t,𝚺,t)subscript^bold-italic-ϵ𝑖subscriptbold-italic-ϵ𝜽𝑖subscript𝐗𝑡𝚺𝑡\hat{{\boldsymbol{\epsilon}}}_{i}={\boldsymbol{\epsilon}}_{{\boldsymbol{\theta% }},i}\left(\mathbf{X}_{t},\boldsymbol{\Sigma},t\right)over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ , italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ , italic_t )
21:             𝐱i,t1=α¯t1(𝐱i,t1α¯tϵ^iα¯t)+1α¯t1ϵ^isubscript𝐱𝑖𝑡1subscript¯𝛼𝑡1subscript𝐱𝑖𝑡1subscript¯𝛼𝑡subscript^bold-italic-ϵ𝑖subscript¯𝛼𝑡1subscript¯𝛼𝑡1subscript^bold-italic-ϵ𝑖\mathbf{x}_{i,t-1}=\sqrt{\bar{\alpha}_{t-1}}\left(\frac{\mathbf{x}_{i,t}-\sqrt% {1-\bar{\alpha}_{t}}\hat{{\boldsymbol{\epsilon}}}_{i}}{\sqrt{\bar{\alpha}_{t}}% }\right)+\sqrt{1-\bar{\alpha}_{t-1}}\hat{{\boldsymbol{\epsilon}}}_{i}bold_x start_POSTSUBSCRIPT italic_i , italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( divide start_ARG bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
22:         end if
23:    end for
24:end for
25:t=1𝑡1t=1italic_t = 1
26:ϵ^=ϵ𝜽(𝐗1,𝚺,t)^bold-italic-ϵsubscriptbold-italic-ϵ𝜽subscript𝐗1𝚺𝑡\hat{{\boldsymbol{\epsilon}}}={\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}}}% \left(\mathbf{X}_{1},\boldsymbol{\Sigma},t\right)over^ start_ARG bold_italic_ϵ end_ARG = bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ , italic_t )
27:𝐙^=𝐗11α¯1ϵ^α¯1^𝐙subscript𝐗11subscript¯𝛼1^bold-italic-ϵsubscript¯𝛼1\hat{\mathbf{Z}}=\frac{\mathbf{X}_{1}-\sqrt{1-\bar{\alpha}_{1}}\hat{{% \boldsymbol{\epsilon}}}}{\sqrt{\bar{\alpha}_{1}}}over^ start_ARG bold_Z end_ARG = divide start_ARG bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_italic_ϵ end_ARG end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG

III-D Training algorithm of semantic communication system

Since DM-MIMO is trained to learn the distribution of the encoded signal 𝐙𝐙\mathbf{Z}bold_Z, a three-stage training algorithm is proposed. In the first stage, the JSCC encoder and decoder are jointly trained to minimize the reconstruction distortion. With MSE adopted as performance metric, the loss function of the first training stage can be written as

Ls1(ϕ,𝝋)=𝔼𝐒p𝐒𝔼𝐘p𝐘|𝐒𝐒𝐒^F2.subscript𝐿𝑠1bold-italic-ϕ𝝋subscript𝔼similar-to𝐒subscript𝑝𝐒subscript𝔼similar-to𝐘subscript𝑝conditional𝐘𝐒subscriptsuperscriptnorm𝐒^𝐒2𝐹L_{s1}\left({\boldsymbol{\phi}},{\boldsymbol{\varphi}}\right)=\mathbb{E}_{% \mathbf{S}\sim p_{\mathbf{S}}}\mathbb{E}_{\mathbf{Y}\sim p_{\mathbf{Y}|\mathbf% {S}}}\left\|\mathbf{S}-\hat{\mathbf{S}}\right\|^{2}_{F}.italic_L start_POSTSUBSCRIPT italic_s 1 end_POSTSUBSCRIPT ( bold_italic_ϕ , bold_italic_φ ) = blackboard_E start_POSTSUBSCRIPT bold_S ∼ italic_p start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_Y ∼ italic_p start_POSTSUBSCRIPT bold_Y | bold_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_S - over^ start_ARG bold_S end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT . (22)

With the well-trained and fixed parameters of the semantic encoder, DM-MIMO is trained in the second stage using Algorithm 1. Specifically, DM-MIMO adopts the whole encoded signal 𝐙𝐙\mathbf{Z}bold_Z as input, leveraging all the semantic information received. Benefiting from the different effective sampling steps and the joint sampling algorithm, DM-MIMO is capable of noise elimination over different sub-channels.

In the third stage, the JSCC decoder is retrained to adapt to the JSCC encoder and the DM-MIMO. Though only the parameters of the JSCC decoder are trained, the entire system operates under real MIMO channels. The loss function is

Ls3(𝝋)=𝔼𝐘p𝐘|𝐒𝐒𝐒^F2.subscript𝐿𝑠3𝝋subscript𝔼similar-tosuperscript𝐘subscript𝑝conditionalsuperscript𝐘𝐒subscriptsuperscriptnorm𝐒^𝐒2𝐹L_{s3}\left({\boldsymbol{\varphi}}\right)=\mathbb{E}_{\mathbf{Y}^{\prime}\sim p% _{\mathbf{Y}^{\prime}|\mathbf{S}}}\left\|\mathbf{S}-\hat{\mathbf{S}}\right\|^{% 2}_{F}.italic_L start_POSTSUBSCRIPT italic_s 3 end_POSTSUBSCRIPT ( bold_italic_φ ) = blackboard_E start_POSTSUBSCRIPT bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_S - over^ start_ARG bold_S end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT . (23)

To investigate the robustness of semantic communication systems, we consider a universal transmission scenario with random transmit power. Therefore, to enhance robustness of the communication system, the channel noise power is randomly chosen from a regime during both the first and the third training stage.

IV Experimental results

In this section, a series of experiments are conducted to evaluate the performance of DM-MIMO-JSCC.

IV-A Experiment Setup

We consider DIV2K dataset in the experiments. This dataset contains 1000100010001000 diverse 2K images from a wide range of real-world scenes, 800800800800 of which are used for training, 100100100100 for validating and the rest 100100100100 for testing. We randomly crop the images into 256×256256256256\times 256256 × 256 patches in the training process. We consider block-fading MIMO channels with antenna number M=2𝑀2M=2italic_M = 2 and channel SNRs ranging from 00 dB to 20202020 dB, where the channel SNR is defined as

SNR=10log10𝔼𝐇,𝐖[𝐇𝐖F2]𝔼𝐍𝐍F2=10log10Psσ2.𝑆𝑁𝑅10subscript10subscript𝔼𝐇𝐖delimited-[]superscriptsubscriptnorm𝐇𝐖𝐹2subscript𝔼𝐍superscriptsubscriptnorm𝐍𝐹210subscript10subscript𝑃𝑠superscript𝜎2\displaystyle SNR=10\log_{10}\frac{\mathbb{E}_{\mathbf{H},\mathbf{W}}\left[% \left\|\mathbf{H}\mathbf{W}\right\|_{F}^{2}\right]}{\mathbb{E}_{\mathbf{N}}% \left\|\mathbf{N}\right\|_{F}^{2}}=10\log_{10}\frac{P_{s}}{\sigma^{2}}.italic_S italic_N italic_R = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT divide start_ARG blackboard_E start_POSTSUBSCRIPT bold_H , bold_W end_POSTSUBSCRIPT [ ∥ bold_HW ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT bold_N end_POSTSUBSCRIPT ∥ bold_N ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (24)

We adopt the latest Swin-Transformer based JSCC [12] for the JSCC in our DM-MIMO-JSCC. For simplicity, channel adaptation is not considered in the JSCC encoder and decoder. The proposed DM-MIMO is established on U-Net architecture [13]. We choose hyper-parameter T=1000𝑇1000T=1000italic_T = 1000 and employ a noise schedule αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that linearly decreases from α1=0.9999subscript𝛼10.9999\alpha_{1}=0.9999italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9999 to αT=0.98subscript𝛼𝑇0.98\alpha_{T}=0.98italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.98. DM-MIMO is trained with an Adam optimizer for 800 epochs, which adopts a cosine warm-up learning rate schedule with initial learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Besides, the JSCC is trained for 800800800800 epochs during the first training stage, with a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The retraining epochs in the third training stage is set to 20202020. We implement DM-MIMO-JSCC on one NVIDIA A40 GPU using Pytorch. For fair comparison, the same Swin-Transformer-based JSCC but without DM-MIMO is considered as a benchmark. It is trained with the same setup as the first training stage of DM-MIMO-JSCC.

Refer to caption
Figure 4: MSE performance versus SNR under 2×2222\times 22 × 2 MIMO channel. CBR is set to 1/12811281/1281 / 128.
Refer to caption
Figure 5: PSNR performance versus SNR under 2×2222\times 22 × 2 MIMO channel. CBR is set to 1/12811281/1281 / 128.

IV-B MSE performance

We first evaluate the effectiveness of DM-MIMO by comparing the MSE between the denoised signal and the encoded signal in DM-MIMO-JSCC and the MSE between the equalized signal and the encoded signal in universal JSCC without DM-MIMO. MSEavgsubscriptMSE𝑎𝑣𝑔{\rm MSE}_{avg}roman_MSE start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT denotes the average MSE across all sub-channels, while MSEisubscriptMSE𝑖{\rm MSE}_{i}roman_MSE start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the average MSE of the i𝑖iitalic_i-th sub-channel. As shown in Fig. 4, with DM-MIMO adopted, both MSE1subscriptMSE1{\rm MSE}_{1}roman_MSE start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and MSE2subscriptMSE2{\rm MSE}_{2}roman_MSE start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT decrease over the SNR regime of [0,20]020[0,20][ 0 , 20 ] dB. The lower the channel SNR, the higher the MSE gain achieved by DM-MIMO on both sub-channels. The MSE gain of MSE1subscriptMSE1{\rm MSE}_{1}roman_MSE start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and MSE2subscriptMSE2{\rm MSE}_{2}roman_MSE start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT achieve over 0.2330.2330.2330.233 dB and 2.4822.4822.4822.482 dB respectively in the SNR regime of [0,20]020[0,20][ 0 , 20 ] dB. As such, our proposed DM-MIMO, with effective sampling step adaptation and joint sampling algorithm, demonstrates effectiveness in noise elimination and signal quality enhancement.

IV-C PSNR performance

Fig. 5 shows the PSNR performance versus the channel SNR. Our DM-MIMO-JSCC outperforms universal JSCC without DM-MIMO in SNR regime from 00 dB to 20202020 dB. The higher the channel SNR, the higher the PSNR gain achieved by DM-MIMO-JSCC. Specifically, DM-MIMO-JSCC achieves a PSNR gain of 0.4880.4880.4880.488 dB at SNR = 20202020 dB. For comparison, the JSCC scheme where training SNR matches testing SNR is also ploted in Fig. 5. It can be seen that the DM-MIMO-JSCC achieves comparable performance in SNR regime of [0,15]015[0,15][ 0 , 15 ] dB. Moreover, we compare the reconstructed samples of different methods under channel SNR of 20202020 dB. As shown in Fig. 6, the samples reconstructed by DM-MIMO-JSCC show better visual quality compared with those reconstructed by universal JSCC without DM-MIMO, as the first one shows a sharper edge and the second one shows a clearer fur detail. In a word, by eliminating noise of the equalized signal, DM-MIMO enhances the robustness of the semantic communication system and achieves better performance in image recovery over various channel conditions.

Refer to caption
Figure 6: Examples of visualization results under 2×2222\times 22 × 2 MIMO channel with channel SNR of 20202020 dB.

IV-D Complexity Analysis

We analyze the computational complexity of the proposed DM-MIMO in terms of multiply accumulate operations (MACs). Table I shows the number of MACs for one step of sampling in DM-MIMO at different CBRs. It can be seen that the amount of MACs of DM-MIMO increases proportionally with the square of CBR. This is expected as the number of channels in U-Net adopted by DM-MIMO directly depends on the length of the received signal. Meanwhile, the number of MACs in JSCC only increases slightly when CBR increases. As such, our DM-MIMO is more suitable at lower CBR region in terms of computational complexity.

V Conclusion

In this paper, we propose a plug-in channel denoising module named DM-MIMO, aiming at enhancing the robustness of semantic communication systems over MIMO channels. By learning the distribution of the encoded signal, DM-MIMO eliminates noise and reduces power fluctuations of the decoder input signal, thereby enhancing the robustness of the semantic communication system. To address the diversity of sub-channel conditions, we employ effective sampling steps correspondingly, and devise a joint sampling algorithm to leverage all the received semantic information while managing the variations in effective noise power. Experimental results demonstrate that DM-MIMO-JSCC outperforms JSCC without DM-MIMO in image recovery.

TABLE I: MACs of DM-MIMO and JSCC with different CBRs.
            CBR            DM-MIMO            JSCC
           0.0026            6.762 G            32.723 G
           0.0039            15.215 G            32.726 G
           0.0078            60.858 G            32.736 G
           0.0104            108.193 G            32.742 G
            

References

  • [1] J. Xu, T.-Y. Tung, B. Ai, W. Chen, Y. Sun, and D. Gündüz, “Deep joint source-channel coding for semantic communications,” IEEE Communications Magazine, vol. 61, no. 11, pp. 42–48, 2023.
  • [2] H. Xie, Z. Qin, G. Y. Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,” IEEE Transactions on Signal Processing, vol. 69, pp. 2663–2675, 2021.
  • [3] S. Wang, J. Dai, Z. Liang, K. Niu, Z. Si, C. Dong, X. Qin, and P. Zhang, “Wireless deep video semantic transmission,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 214–229, 2023.
  • [4] P. Zhang, W. Xu, H. Gao, K. Niu, X. Xu, X. Qin, C. Yuan, Z. Qin, H. Zhao, J. Wei, and F. Zhang, “Toward wisdom-evolutionary and primitive-concise 6G: A new paradigm of semantic communication networks,” Engineering, vol. 8, pp. 60–73, 2022.
  • [5] H. Wu, Y. Shao, C. Bian, K. Mikolajczyk, and D. Gündüz, “Deep joint source-channel coding for adaptive image transmission over MIMO channels,” arXiv:2309.00470, 2023.
  • [6] G. Zhang, Q. Hu, Y. Cai, and G. Yu, “SCAN: Semantic communication with adaptive channel feedback,” IEEE Transactions on Cognitive Communications and Networking, pp. 1–1, 2024.
  • [7] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” International Conference on Learning Representations, 2021.
  • [8] S. F. Yilmaz, X. Niu, B. Bai, W. Han, L. Deng, and D. Gunduz, “High perceptual quality wireless image delivery with denoising diffusion models,” arXiv:2309.15889, 2023.
  • [9] E. Grassucci, C. Marinoni, A. Rodriguez, and D. Comminiello, “Diffusion models for audio semantic communication,” ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13 136–13 140, 2024.
  • [10] T. Wu, Z. Chen, D. He, L. Qian, Y. Xu, M. Tao, and W. Zhang, “CDDM: Channel denoising diffusion models for wireless semantic communications,” IEEE Transactions on Wireless Communications, pp. 1–1, 2024.
  • [11] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 11 461–11 471.
  • [12] K. Yang, S. Wang, J. Dai, K. Tan, K. Niu, and P. Zhang, “WITT: A wireless image transmission transformer for semantic communications,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  • [13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 2015, pp. 234–241.