DM-MIMO: Diffusion Models for Robust Semantic Communications over MIMO Channels

Yiheng Duan, Tong Wu, Zhiyong Chen and Meixia Tao
Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, China
Emails: {duanyiheng, wu_tong, zhiyongchen, mxtao}@sjtu.edu.cn This work is supported by the NSF of China under grant 62125108 and 62222111.

Abstract

This paper investigates robust semantic communications over multiple-input multiple-output (MIMO) fading channels. Current semantic communications over MIMO channels mainly focus on channel adaptive encoding and decoding, which lacks exploration of signal distribution. To leverage the potential of signal distribution in signal space denoising, we develop a diffusion model over MIMO channels (DM-MIMO), a plug-in module at the receiver side in conjunction with singular value decomposition (SVD) based precoding and equalization. Specifically, due to the significant variations in effective noise power over distinct sub-channels, we determine the effective sampling steps accordingly and devise a joint sampling algorithm. Utilizing a three-stage training algorithm, DM-MIMO learns the distribution of the encoded signal, which enables noise elimination over all sub-channels. Experimental results demonstrate that the DM-MIMO effectively reduces the mean square errors (MSE) of the equalized signal and the DM-MIMO semantic communication system (DM-MIMO-JSCC) outperforms the JSCC-based semantic communication system in image reconstruction.

Index Terms:

Semantic communications, multiple-input multiple-output (MIMO), diffusion models (DMs).

I Introduction

Recently, semantic communications have attracted extensive attention thanks to their great potential in improving transmission efficiency. By leveraging the rapid advancements in deep learning, semantic communications can adeptly extract and transmit meaningful semantic information through neural network (NN) based joint source-channel coding (JSCC), and have demonstrated superiority over traditional bit communications in various types of source transmissions [1, 2, 3]. Thus far, semantic communications are regarded as a highly promising technique for 6G wireless communication networks and beyond [4].

Despite the great potential of semantic communications, most existing works primarily focus on single-input single-output (SISO) channels. It is thus of great importance and need to investigate semantic communications over multiple-input multiple output (MIMO) channels, given that MIMO has played a leading role in boosting channel capacity and transmission reliability since 3G wireless communications. The main distinction between SISO and MIMO channels in semantic communications lies in how to allocate semantic information over sub-channels with the aid of channel state information (CSI). To this end, in [5], the proposed DeepJSCC-MIMO adopts SVD-based precoding and equalization, constructing a channel-condition-based heatmap as an additional input for encoding and decoding. In [6], in addition to SVD-based precoding and equalization, channel and feature attention (CFA) modules are embedded in the JSCC encoder and decoder to adapt to MIMO channel conditions. However, despite the above channel adaptive encoding and decoding, performing signal denoising also holds the potential to further enhance the performance of semantic communications under MIMO channels, which requires further investigation.

As an advanced type of generative models, diffusion models (DMs) not only achieve great success in image generation [7] but also show advancements in image [8] and audio [9] restoration. Introducing information decay to source data in the forward diffusion process by adding noise, DMs are trained to learn the decay with NNs. With noise of different power captured in different steps, DMs are not only capable of sample generation but also available for signal denoising. For example, recently, channel denoising diffusion models (CDDM) are proposed to mitigate the impact of channel noise in SISO channels with the adaptive forward diffusion and the corresponding reverse sampling process [10].

Refer to caption — Figure 1: Architecture of DM-MIMO-JSCC.

Inspired by the above, we develop a diffusion model over MIMO channels (DM-MIMO) as a plug-in module at the receiver, eliminating noise and enhancing signal quality through learning signal distribution, thus further improving the performance of semantic communication systems over MIMO channels. Through SVD-based precoding and equalization, MIMO channels are decomposed into parallel sub-channels, each with different effective noise power. The existing DMs only consider a fixed noise power in each sampling step, thereby failing to adapt to varying channel conditions across sub-channels. Given the effective noise power over different sub-channels, DM-MIMO employs different effective sampling steps correspondingly. Based on these effective sampling steps, in order to maintain the correct distribution properties of the input of each sampling step, DM-MIMO applies a joint sampling algorithm, adjusting the equalized signal through either noise addition or the reverse sampling process Moreover, employing a three-stage training algorithm, DM-MIMO is able to learn the distribution of the encoded signal, which enhances the performance of signal denoising and reduces power fluctuations. Utilizing these training and sampling algorithms, the proposed DM-MIMO enhances the robustness of the semantic communication system across a wide range of channel noise power. Additionally, as a plug-in module, DM-MIMO is independent of the structure of JSCC, allowing for flexible implementation in semantic communication systems.

We evaluate the performance of DM-MIMO through extensive experiments. With DM-MIMO, significant reduction of mean square error (MSE) between the decoder input signal and the encoded signal is achieved. This reduction indicates an enhancement in signal quality, thereby enhancing image recovery. As a result, the DM-MIMO semantic communication system (DM-MIMO-JSCC) outperforms the existing JSCC-based semantic communication system in terms of peak signal to noise ratio (PSNR).

II Preliminary of DM and System Overview

II-A Preliminary of DM

DMs achieve data generation through a progressive denoising procedure. With the forward diffusion process gradually corrupting data by adding noise, the reverse sampling process of DMs performs the opposite procedure, producing samples sharing the same distribution as the source data. Specifically, for a given data $\mathbf{X}_{0}$ with distribution $q(\mathbf{X}_{0})$ , the forward diffusion process is derived as a Markov chain, generating a sequence of random variables $\mathbf{X}_{1},\mathbf{X}_{2},\cdots,\mathbf{X}_{T}$ modeled by

q\left(\mathbf{X}_{1:T}|\mathbf{X}_{0}\right)\triangleq\prod_{t=1}^{T}q\left(% \mathbf{X}_{t}|\mathbf{X}_{t-1}\right),

(1)

where $T$ is the number of diffusion steps, and $q(\mathbf{X}_{t}|\mathbf{X}_{t-1})$ denotes the conditional distribution of step $t$ , formulated as

q(\mathbf{X}_{t}|\mathbf{X}_{t-1})\triangleq\mathcal{N}\left(\mathbf{X}_{t};% \sqrt{\alpha_{t}}\mathbf{X}_{t-1},(1-\alpha_{t})\mathbf{I}\right).

(2)

Here, $\alpha_{t}\in(0,1)$ is the noise schedule chosen ahead of model training. The reverse sampling process starts by sampling a pure Gaussian noise $\mathbf{X}_{T}$ consists of independent and identically distributed (i.i.d.) elements with distribution $\mathcal{N}(0,1)$ , and then gradually generates the target data by

p_{{\boldsymbol{\theta}}}(\mathbf{X}_{0})=p(\mathbf{X}_{T})\prod^{T}_{t=1}p_{{% \boldsymbol{\theta}}}(\mathbf{X}_{t-1}|\mathbf{X}_{t}),

(3)

where ${\boldsymbol{\theta}}$ are the learnable parameters of DMs, and

p_{\boldsymbol{\theta}}(\mathbf{X}_{t-1}|\mathbf{X}_{t})=\mathcal{N}\left(% \mathbf{X}_{t-1};\mu_{{\boldsymbol{\theta}}}(\mathbf{X}_{t},t),\Sigma_{{% \boldsymbol{\theta}}}(\mathbf{X}_{t},t)\right).

(4)

Here $\mu_{\boldsymbol{\theta}}$ and $\Sigma_{\boldsymbol{\theta}}$ denotes the mean and variance of the conditional distribution $p_{\boldsymbol{\theta}}(\mathbf{X}_{t-1}|\mathbf{X}_{t})$ predicted by DMs. As such, DMs are capable of noise elimination. We can first simulate different noise addition process by selecting different diffusion steps, then adopt corresponding sampling steps to recover the source samples. Leveraging the denoising capability of DMs, we propose a plug-in denoising module DM-MIMO, detailed in Section III.

II-B System overview

Consider a semantic communication system performing image transmission over MIMO channels. For simplicity, we assume both the transmitter and receiver are equipped with $M$ antennas. The input image signal is represented by $\mathbf{S}\in\mathbb{R}^{h\times w\times 3}$ , where $h$ and $w$ denote the height and width of the input image respectively, while 3 is the quantity of color channels. The image is transmitted over $k$ channel uses, and the channel bandwidth ratio (CBR) is defined as $\text{CBR}=k/n$ , where $n=3hw$ .

As illustrated in Fig. 1, at the transmitter, the input image first adopts a JSCC encoding function $f_{{\boldsymbol{\phi}}}$ parameterized by ${\boldsymbol{\phi}}$ to output the encoded signal $\mathbf{Z}=[\mathbf{z}_{1},\cdots,\mathbf{z}_{M}]^{T}\in\mathbb{C}^{M\times k}$ , expressed as $\mathbf{Z}=f_{{\boldsymbol{\phi}}}(\mathbf{S})$ . The encoded signal $\mathbf{Z}$ is then mapped into the channel input signal $\mathbf{W}\in\mathbb{C}^{M\times k}$ via MIMO precoding. The power constraint $P_{s}$ is given by $\frac{1}{k}\left\|\mathbf{W}\right\|^{2}_{F}\leq P_{s}$ .

We consider Rayleigh block fading MIMO channels. Let $\mathbf{H}\in\mathbb{C}^{M\times M}$ be the channel matrix which remains unchanged over $k$ channel uses. Each entry of $\mathbf{H}$ is i.i.d. random variables following the complex normal distribution $\mathcal{CN}(0,1)$ . Then, the output signal of the MIMO channel can be written as

\mathbf{Y}=\mathbf{H}\mathbf{W}+\mathbf{N},

(5)

where $\mathbf{N}\in\mathbb{C}^{M\times k}$ is the additive noise term that consists of i.i.d. elements with distribution $\mathcal{CN}(0,\sigma^{2})$ , in which $\sigma^{2}$ denotes the channel noise power.

As CSI is accessible to both the transmitter and receiver, we adopt SVD-based precoding and equalization to leverage spatial multiplexing and mitigate inter-channel interference. The channel matrix $\mathbf{H}$ can be decomposed as

\mathbf{H}=\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^{H},

(6)

where $\mathbf{U}=[\mathbf{u}_{1},\cdots,\mathbf{u}_{M}]\in\mathbb{C}^{M\times M}$ and $\mathbf{V}\in\mathbb{C}^{M\times M}$ are unitary matrices, and $\boldsymbol{\Sigma}={\rm diag}[\lambda_{1},\cdots,\lambda_{M}]\in\mathbb{C}^{M% \times M}$ is a diagonal matrix with singular values $\lambda_{1}\geq\cdots\geq\lambda_{M}$ . Applying $\mathbf{V}$ as the precoder at the transmitter, we have $\mathbf{W}=\mathbf{V}\mathbf{Z}$ , thus (5) becomes

\displaystyle\mathbf{Y}=\mathbf{H}\mathbf{V}\mathbf{Z}+\mathbf{N}=\mathbf{U}% \boldsymbol{\Sigma}\mathbf{Z}+\mathbf{N}.

(7)

With the aid of SVD-based equalizer, the equalized signal $\mathbf{Y}^{\prime}=\boldsymbol{\Sigma}^{\dagger}\mathbf{U}^{H}\mathbf{Y}$ at the receiver can be expressed as

\mathbf{Y}^{\prime}=\mathbf{Z}+\mathbf{N}^{\prime},

(8)

where $\mathbf{N}^{\prime}=\boldsymbol{\Sigma}^{\dagger}\mathbf{U}^{H}\mathbf{N}=[% \mathbf{n}^{\prime}_{1},\cdots,\mathbf{n}^{\prime}_{M}]^{T}$ , and $\boldsymbol{\Sigma}^{\dagger}={\rm diag}[\frac{1}{\lambda_{1}},\cdots,\frac{1}% {\lambda_{M}}]$ . Consequently, after employing SVD-based precoding and equalization, the MIMO channels are decomposed into $M$ parallel sub-channels. For sub-channel $i$ , the effective noise power is $\sigma_{i}^{2}=\frac{\sigma^{2}}{\lambda_{i}^{2}}$ . In order to remove noise from the equalized signal $\mathbf{Y}^{\prime}$ , we introduce a DM-MIMO module $g_{{\boldsymbol{\theta}}}(\cdot)$ with parameters ${\boldsymbol{\theta}}$ , represented as $\hat{\mathbf{Z}}=g_{{\boldsymbol{\theta}}}(\mathbf{Y}^{\prime})$ , detailed in Section III. Finally the JSCC decoding function with parameter ${\boldsymbol{\varphi}}$ takes the denoised signal $\hat{\mathbf{Z}}\in\mathbb{C}^{M\times k}$ as input, reconstructing the input image $\hat{\mathbf{S}}=f_{{\boldsymbol{\varphi}}}(\hat{\mathbf{Z}})$ .

III Design of DM-MIMO

To enhance the robustness of semantic communications, we propose DM-MIMO, a diffusion model eliminating noise over the equalized signal. Due to the high variations of effective noise power over sub-channels, we derive noise-power-aware effective sampling steps and devise a joint sampling algorithm.

III-A Analysis of sub-channel conditions

Considering block fading, disparities in channel conditions occur over distinct sub-channels. Moreover, under varying channel conditions, the JSCC encoder and decoder jointly learn semantic feature allocation across sub-channels along with semantic feature extraction. The fluctuating channel conditions and the different significance of semantics across parallel sub-channels are crucial factors to be considered in the design of the denoising module.

As $\mathbf{U}$ is a unitary matrix, the noise of different sub-channels in (8) follows an independent Gaussian distribution. Moreover, the effective noise power $\sigma_{i}^{2}$ in different sub-channels varies significantly due to its dependence on the channel singular value $\lambda_{i}$ . To illustrate the difference in effective noise power over sub-channels, we perform a Monte Carlo experiment with $1\times 10^{7}$ samples of $2\times 2$ MIMO Rayleigh fading channels. As shown in Fig. 2, the gap between the expectation of $\lambda_{i}^{2}$ in the two sub-channels reaches $10.37$ dB, presenting a challenge in joint denoising design.

With significant gap in effective noise power between different sub-channels, the naive method of applying one sampling step for all sub-channels fails to eliminate noise effectively. Therefore, an individual effective sampling step need to be employed for each sub-channel according to its effective noise power. Another naive method to remove noise is to employ a separate DM module for each sub-channel. This, however, fails to capture the joint distribution of the encoded signals over different sub-channels. Hence, based on the effective sampling steps, we propose a joint sampling algorithm, which is detailed in the following sub-sections.

III-B DM-MIMO

We denote the output of each sub-channel as $\mathbf{y}^{\prime}_{i}$ , which is the $i$ -th row of $\mathbf{Y}^{\prime}$ . To adapt to different channel matrices $\mathbf{H}$ , for sub-channel $i$ , we employ normalization with factor ${\sigma}_{i}$ as

\displaystyle\bar{\mathbf{y}}^{\prime}_{i}

\displaystyle=\frac{1}{\sqrt{1+{{\sigma}_{i}}^{2}}}\mathbf{y}^{\prime}_{i}=% \frac{1}{\sqrt{1+{\sigma}_{i}^{2}}}{\mathbf{z}_{i}}+\frac{{\sigma}_{i}}{\sqrt{% 1+{\sigma}_{i}^{2}}}\mathbf{u}^{H}_{i}\frac{\mathbf{N}}{\sigma}.

(9)

We define $\mathbf{x}_{i,0}=\mathbf{z}_{i}$ , $\mathbf{X}_{0}=[\mathbf{x}_{1,0},\cdots,\mathbf{x}_{M,0}]^{T}$ and design a forward diffusion process inspired by [7]

\mathbf{x}_{i,t}=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{i,0}+\sqrt{1-\bar{\alpha}_% {t}}{\boldsymbol{\epsilon}}_{i},

(10)

where $\bar{\alpha}_{t}=\prod_{l=1}^{t}\alpha_{l}$ , $t\in\{1,2,\cdots,T\}$ and ${\boldsymbol{\epsilon}}=[{\boldsymbol{\epsilon}}_{1},\cdots,{\boldsymbol{% \epsilon}}_{M}]^{T}$ denotes the additive noise term consisting of i.i.d. elements following $\mathcal{CN}(0,1)$ . To handle signals over sub-channels with different effective noise power, DM-MIMO generates signals sharing the same distribution as $\bar{\mathbf{y}}^{\prime}_{i}$ . Specifically, DM-MIMO utilizes the forward diffusion process with additive noise power matching the effective noise power, enabling the corresponding sampling process for denoising. Therefore, the effective sampling step $m_{i}$ is given by

m_{i}=\mathop{\mathrm{argmin}}\limits_{m\in\{1,2,\cdots,T\}}\left|{\sigma}_{i}% ^{2}-\frac{{1-\bar{\alpha}_{m}}}{{{\bar{\alpha}_{m}}}}\right|.

(11)

By employing tailored effective sampling steps, DM-MIMO is able to handle different signals over different sub-channels in one sampling process, utilizing the inter-sub-channel signal correlations. Moreover, considering the channel noise power varying in a wide range, additive channel noise $\mathbf{N}^{\prime}$ introduces dramatic power fluctuations to the equalized signal $\mathbf{Y}^{\prime}$ . DM-MIMO reduces the power fluctuations by learning the distribution of the encoded signal $\mathbf{Z}$ and eliminating noise through the reverse sampling process, thus enhancing the robustness of the receiver.

$\displaystyle L=$	$\displaystyle\,\mathbb{E}\left[-\log{{p_{{\boldsymbol{\theta}}}}(\mathbf{X}_{0% }\|\boldsymbol{\Sigma})}\right]$	(12)
$\displaystyle\leq$	$\displaystyle\,\mathbb{E}_{q}\left[-\log{\left(\frac{p_{{\boldsymbol{\theta}}}% (\mathbf{X}_{0:T},\mathbf{Y}^{\prime}\|\boldsymbol{\Sigma})}{q(\mathbf{X}_{1:T}% ,\mathbf{Y}^{\prime}\|\mathbf{X}_{0},\boldsymbol{\Sigma})}\right)}\right]$	(13)
$\displaystyle=$	$\displaystyle\,\mathbb{E}_{q}\Bigg{[}\underbrace{D_{KL}\left(q\left(\mathbf{Y}% ^{\prime}\|\mathbf{X}_{0},\boldsymbol{\Sigma}\right)\|\|p\left(\mathbf{Y}^{\prime% }\|\boldsymbol{\Sigma}\right)\right)}_{L_{\mathbf{Y}^{\prime}}}-\underbrace{% \log{p_{{\boldsymbol{\theta}}}\left(\mathbf{X}_{0}\|\mathbf{X}_{1},\boldsymbol{% \Sigma}\right)}}_{L_{0}}+\underbrace{D_{KL}\left(q\left(\mathbf{X}_{T}\|\mathbf% {Y}^{\prime},\mathbf{X}_{0},\boldsymbol{\Sigma}\right)\|\|p\left(\mathbf{X}_{T}\|% \mathbf{Y}^{\prime},\boldsymbol{\Sigma}\right)\right)}_{L_{T}}$
	$\displaystyle+\sum^{T}_{t=1}\underbrace{{D_{KL}\left(q\left(\mathbf{X}_{t-1}\|% \mathbf{X}_{t},\mathbf{X}_{0},\boldsymbol{\Sigma}\right)\|\|p_{{\boldsymbol{% \theta}}}\left(\mathbf{X}_{t-1}\|\mathbf{X}_{t},\boldsymbol{\Sigma}\right)% \right)}}_{L_{t-1}}\Bigg{]}.$	(14)

In the training process, as $\mathbf{x}_{i,m_{i}}$ and $\mathbf{y}^{\prime}_{i}$ share the same distribution, DM-MIMO can be trained on the forward diffusion process of $\mathbf{Z}$ instead of $\mathbf{Y}^{\prime}$ . Aiming to recover ${\mathbf{Z}}$ through learning its distribution, the loss function $L$ is defined by the variational bound on the negative log likelihood function of $\mathbf{X}_{0}$ , as given in (12). Analyzing the components of loss function, $L_{\mathbf{Y}^{\prime}}$ and $L_{t}$ can be ignored during training as they are not related to ${{\boldsymbol{\theta}}}$ . Therefore, we focus on $L_{0}$ and $L_{t-1}$ , revealing the training goal of approximating the distribution of $q(\mathbf{X}_{t-1}|\mathbf{X}_{t},\mathbf{X}_{0},\boldsymbol{\Sigma})$ with $p_{{\boldsymbol{\theta}}}(\mathbf{X}_{t-1}|\mathbf{X}_{t},\boldsymbol{\Sigma})$ . After re-parameterization and re-weighting, the loss function $L_{t-1}$ can be simplified as

\mathbb{E}_{\mathbf{X}_{0},{\boldsymbol{\epsilon}},\boldsymbol{\Sigma}}\left(% \left\|{\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}}}% (\mathbf{X}_{t},\boldsymbol{\Sigma},t)\right\|_{2}^{2}\right).

(15)

Finally, to optimize (15) for all $t\in\{1,\cdots,T\}$ , the loss function of DM-MIMO can be expressed as

L_{DM-MIMO}({{\boldsymbol{\theta}}})=\mathbb{E}_{\mathbf{X}_{0},{\boldsymbol{% \epsilon}},\boldsymbol{\Sigma},t}\left(\left\|{\boldsymbol{\epsilon}}-{% \boldsymbol{\epsilon}}_{{\boldsymbol{\theta}}}(\mathbf{X}_{t},\boldsymbol{% \Sigma},t)\right\|_{2}^{2}\right).

(16)

The training process of DM-MIMO is detailed in Algorithm 1. With different effective sampling steps chosen according to different effective noise power $\sigma_{i}^{2}$ , DM-MIMO demonstrates the ability of denoising under diverse channel conditions while utilizing fixed parameters.

Algorithm 1 Training Algorithm of DM-MIMO

Input: Encoded signal set, diffusion steps $T$ and noise schedule parameter $\bar{\alpha}_{t}$ for $t\in\{1,\cdots,T\}$ ;
Output: Trained DM-MIMO model parameters ${\boldsymbol{\theta}}$ ;

1:while the stop condition is not met do

2: Randomly sample

\mathbf{Z}

from encoded signal set

3: Randomly sample

t

from

Uniform(\{1,\cdots,T\})

4: Randomly sample

\mathbf{H}

5: for all

i=1

M

6: Randomly sample

{\boldsymbol{\epsilon}}_{i}

from

\mathcal{N}\left(0,\mathbf{I}_{2k}\right)

7: end for

{\boldsymbol{\epsilon}}=[{\boldsymbol{\epsilon}}_{1},\cdots,{\boldsymbol{% \epsilon}}_{M}]

9: Generate sample

\mathbf{X}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{Z}+\sqrt{1-\bar{\alpha}_{t}}{% \boldsymbol{\epsilon}}

10: Take gradient descent step:

\nabla_{{\boldsymbol{\theta}}}\left(\left\|{\boldsymbol{\epsilon}}-{% \boldsymbol{\epsilon}}_{{\boldsymbol{\theta}}}\left(\mathbf{X}_{t},\boldsymbol% {\Sigma},t\right)\right\|_{2}^{2}\right)

11:end while

III-C Sampling Algorithm of DM-MIMO

Inspired by [11], we design a joint sampling algorithm for DM-MIMO. Specifically, in the $t$ -th sampling step of DM-MIMO, in order to guide the denoising process of equalized signals with high effective noise power, we add noise to the equalized signals with low effective noise power and send them to the reverse sampling step. In a word, we employ either noise addition or the reverse sampling process for each sub-channel based on the value of its effective sampling step $m_{i}$ .

We begin the sampling process from sampling step $t=\max\{m_{1},\cdots,m_{M}\}$ . As illustrated in Fig. 3, if $m_{i}\leq(t-1)$ , to keep the correct properties of the distribution of $\mathbf{X}_{t-1}$ , we derive $\mathbf{x}_{i,t-1}$ by adding noise to $\bar{\mathbf{y}}^{\prime}_{i}$ , given by

\displaystyle\mathbf{x}_{i,t-1}=\sqrt{\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{% m_{i}}}}\bar{\mathbf{y}}^{\prime}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t-1}}{\bar{% \alpha}_{m_{i}}}}{\boldsymbol{\epsilon}}_{i}.

(17)

On the other hand, if $m_{i}>(t-1)$ , assume the knowledge of $\mathbf{x}_{i,t}$ and $\mathbf{x}_{i,0}$ to be available, we can derive the sampling process of $\mathbf{x}_{i,t-1}$ as

\mathbf{x}_{i,t-1}=\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_{i,0}+\sqrt{1-\bar{% \alpha}_{t-1}}\frac{\mathbf{x}_{i,t}-\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{i,0}}{% \sqrt{1-\bar{\alpha}_{t}}},

(18)

where $\mathbf{x}_{i,0}$ can be acquired by re-writing (10) as

\mathbf{x}_{i,0}=\frac{1}{\sqrt{\bar{\alpha}_{t}}}\left(\mathbf{x}_{i,t}-\sqrt% {1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}}_{i}\right).

(19)

In the reverse sampling process, with only $\mathbf{x}_{i,t}$ and ${\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}},i}(\mathbf{X}_{t},\boldsymbol{% \Sigma},t)$ known to the receiver, ${\boldsymbol{\epsilon}}_{i}$ is replaced with ${\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}},i}(\mathbf{X}_{t},\boldsymbol{% \Sigma},t)$ . Therefore, the sampling process can be expressed as

	$\displaystyle\mathbf{x}_{i,t-1}=$	$\displaystyle\,\sqrt{\bar{\alpha}_{t-1}}\left(\frac{1}{\sqrt{\bar{\alpha}_{t}}% }\left(\mathbf{x}_{i,t}-\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}}_{{% \boldsymbol{\theta}},i}\left(\mathbf{X}_{t},\boldsymbol{\Sigma},t\right)\right% )\right)$
		$\displaystyle+\sqrt{1-\bar{\alpha}_{t-1}}{\boldsymbol{\epsilon}}_{{\boldsymbol% {\theta}},i}\left(\mathbf{X}_{t},\boldsymbol{\Sigma},t\right).$		(20)

As for the last sampling step $t=1$ , DM-MIMO only predicts $\mathbf{x}_{i,0}$ with $\mathbf{x}_{i,1}$ , given by

\mathbf{x}_{i,0}=\frac{1}{\sqrt{\bar{\alpha}_{1}}}\left(\mathbf{x}_{i,1}-\sqrt% {1-\bar{\alpha}_{1}}{\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}},i}\left(% \mathbf{X}_{1},\boldsymbol{\Sigma},1\right)\right).

(21)

As outlined in Algorithm 2, for sub-channel $i$ , by comparing $m_{i}$ with the sampling step $t$ , the joint sampling algorithm either adds noise or performs the reverse sampling process, thus leveraging all the semantic information while addressing different effective noise power over different sub-channels.

Algorithm 2 Sampling Algorithm of DM-MIMO

Input: Equalized signal $\mathbf{Y}^{\prime}$ , channel matrix $\mathbf{H}$ , channel noise power $\sigma^{2}$ ;
Output: Denoised signal $\hat{\mathbf{Z}}$ ;

[\mathbf{y}^{\prime}_{1},\cdots,\mathbf{y}^{\prime}_{M}]=\mathbf{Y}^{\prime}

2:for all

i=1

M

3: Calculate

\lambda_{i}

and

\sigma_{i}

based on

\mathbf{H}

and

\sigma^{2}

\bar{\mathbf{y}}^{\prime}_{i}=\frac{1}{\sqrt{1+{{\sigma}_{i}}^{2}}}\mathbf{y}^% {\prime}_{i}

5: Calculate

m_{i}

according to sub-channel

{\sigma}_{i}

6:end for

m_{max}=\max\{m_{1},\cdots,m_{M}\}

t=m_{max}

9:for all

i=1

M

10: Randomly sample

{\boldsymbol{\epsilon}}_{i}

from

\mathcal{N}\left(0,\mathbf{I}_{2k}\right)

11:

\mathbf{x}_{i,t}=\sqrt{\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{m_{i}}}}\bar{% \mathbf{y}}^{\prime}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{m_{i}}}% }{\boldsymbol{\epsilon}}_{i}

12:end for

13:for

t=m_{max},m_{max}-1,\cdots,2

14:

\mathbf{X}_{t}=[\mathbf{x}_{1,t},\cdots,\mathbf{x}_{M,t}]

15: for all

i=1

M

16: if

m_{i}\leq t-1

then

17: Randomly sample

{\boldsymbol{\epsilon}}_{i}

from

\mathcal{N}\left(0,\mathbf{I}_{2k}\right)

18:

\mathbf{x}_{i,t-1}=\sqrt{\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{m_{i}}}}\bar{% \mathbf{y}}^{\prime}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{m_{i}% }}}{\boldsymbol{\epsilon}}_{i}

19: else

20:

\hat{{\boldsymbol{\epsilon}}}_{i}={\boldsymbol{\epsilon}}_{{\boldsymbol{\theta% }},i}\left(\mathbf{X}_{t},\boldsymbol{\Sigma},t\right)

21:

\mathbf{x}_{i,t-1}=\sqrt{\bar{\alpha}_{t-1}}\left(\frac{\mathbf{x}_{i,t}-\sqrt% {1-\bar{\alpha}_{t}}\hat{{\boldsymbol{\epsilon}}}_{i}}{\sqrt{\bar{\alpha}_{t}}% }\right)+\sqrt{1-\bar{\alpha}_{t-1}}\hat{{\boldsymbol{\epsilon}}}_{i}

22: end if

23: end for

24:end for

25:

t=1

26:

\hat{{\boldsymbol{\epsilon}}}={\boldsymbol{\epsilon}}_{{\boldsymbol{\theta}}}% \left(\mathbf{X}_{1},\boldsymbol{\Sigma},t\right)

27:

\hat{\mathbf{Z}}=\frac{\mathbf{X}_{1}-\sqrt{1-\bar{\alpha}_{1}}\hat{{% \boldsymbol{\epsilon}}}}{\sqrt{\bar{\alpha}_{1}}}

III-D Training algorithm of semantic communication system

Since DM-MIMO is trained to learn the distribution of the encoded signal $\mathbf{Z}$ , a three-stage training algorithm is proposed. In the first stage, the JSCC encoder and decoder are jointly trained to minimize the reconstruction distortion. With MSE adopted as performance metric, the loss function of the first training stage can be written as

L_{s1}\left({\boldsymbol{\phi}},{\boldsymbol{\varphi}}\right)=\mathbb{E}_{% \mathbf{S}\sim p_{\mathbf{S}}}\mathbb{E}_{\mathbf{Y}\sim p_{\mathbf{Y}|\mathbf% {S}}}\left\|\mathbf{S}-\hat{\mathbf{S}}\right\|^{2}_{F}.

(22)

With the well-trained and fixed parameters of the semantic encoder, DM-MIMO is trained in the second stage using Algorithm 1. Specifically, DM-MIMO adopts the whole encoded signal $\mathbf{Z}$ as input, leveraging all the semantic information received. Benefiting from the different effective sampling steps and the joint sampling algorithm, DM-MIMO is capable of noise elimination over different sub-channels.

In the third stage, the JSCC decoder is retrained to adapt to the JSCC encoder and the DM-MIMO. Though only the parameters of the JSCC decoder are trained, the entire system operates under real MIMO channels. The loss function is

L_{s3}\left({\boldsymbol{\varphi}}\right)=\mathbb{E}_{\mathbf{Y}^{\prime}\sim p% _{\mathbf{Y}^{\prime}|\mathbf{S}}}\left\|\mathbf{S}-\hat{\mathbf{S}}\right\|^{% 2}_{F}.

(23)

To investigate the robustness of semantic communication systems, we consider a universal transmission scenario with random transmit power. Therefore, to enhance robustness of the communication system, the channel noise power is randomly chosen from a regime during both the first and the third training stage.

IV Experimental results

In this section, a series of experiments are conducted to evaluate the performance of DM-MIMO-JSCC.

IV-A Experiment Setup

We consider DIV2K dataset in the experiments. This dataset contains $1000$ diverse 2K images from a wide range of real-world scenes, $800$ of which are used for training, $100$ for validating and the rest $100$ for testing. We randomly crop the images into $256\times 256$ patches in the training process. We consider block-fading MIMO channels with antenna number $M=2$ and channel SNRs ranging from $0$ dB to $20$ dB, where the channel SNR is defined as

\displaystyle SNR=10\log_{10}\frac{\mathbb{E}_{\mathbf{H},\mathbf{W}}\left[% \left\|\mathbf{H}\mathbf{W}\right\|_{F}^{2}\right]}{\mathbb{E}_{\mathbf{N}}% \left\|\mathbf{N}\right\|_{F}^{2}}=10\log_{10}\frac{P_{s}}{\sigma^{2}}.

(24)

We adopt the latest Swin-Transformer based JSCC [12] for the JSCC in our DM-MIMO-JSCC. For simplicity, channel adaptation is not considered in the JSCC encoder and decoder. The proposed DM-MIMO is established on U-Net architecture [13]. We choose hyper-parameter $T=1000$ and employ a noise schedule $\alpha_{t}$ that linearly decreases from $\alpha_{1}=0.9999$ to $\alpha_{T}=0.98$ . DM-MIMO is trained with an Adam optimizer for 800 epochs, which adopts a cosine warm-up learning rate schedule with initial learning rate of $1\times 10^{-4}$ . Besides, the JSCC is trained for $800$ epochs during the first training stage, with a learning rate of $1\times 10^{-4}$ . The retraining epochs in the third training stage is set to $20$ . We implement DM-MIMO-JSCC on one NVIDIA A40 GPU using Pytorch. For fair comparison, the same Swin-Transformer-based JSCC but without DM-MIMO is considered as a benchmark. It is trained with the same setup as the first training stage of DM-MIMO-JSCC.

IV-B MSE performance

We first evaluate the effectiveness of DM-MIMO by comparing the MSE between the denoised signal and the encoded signal in DM-MIMO-JSCC and the MSE between the equalized signal and the encoded signal in universal JSCC without DM-MIMO. ${\rm MSE}_{avg}$ denotes the average MSE across all sub-channels, while ${\rm MSE}_{i}$ denotes the average MSE of the $i$ -th sub-channel. As shown in Fig. 4, with DM-MIMO adopted, both ${\rm MSE}_{1}$ and ${\rm MSE}_{2}$ decrease over the SNR regime of $[0,20]$ dB. The lower the channel SNR, the higher the MSE gain achieved by DM-MIMO on both sub-channels. The MSE gain of ${\rm MSE}_{1}$ and ${\rm MSE}_{2}$ achieve over $0.233$ dB and $2.482$ dB respectively in the SNR regime of $[0,20]$ dB. As such, our proposed DM-MIMO, with effective sampling step adaptation and joint sampling algorithm, demonstrates effectiveness in noise elimination and signal quality enhancement.

IV-C PSNR performance

Fig. 5 shows the PSNR performance versus the channel SNR. Our DM-MIMO-JSCC outperforms universal JSCC without DM-MIMO in SNR regime from $0$ dB to $20$ dB. The higher the channel SNR, the higher the PSNR gain achieved by DM-MIMO-JSCC. Specifically, DM-MIMO-JSCC achieves a PSNR gain of $0.488$ dB at SNR = $20$ dB. For comparison, the JSCC scheme where training SNR matches testing SNR is also ploted in Fig. 5. It can be seen that the DM-MIMO-JSCC achieves comparable performance in SNR regime of $[0,15]$ dB. Moreover, we compare the reconstructed samples of different methods under channel SNR of $20$ dB. As shown in Fig. 6, the samples reconstructed by DM-MIMO-JSCC show better visual quality compared with those reconstructed by universal JSCC without DM-MIMO, as the first one shows a sharper edge and the second one shows a clearer fur detail. In a word, by eliminating noise of the equalized signal, DM-MIMO enhances the robustness of the semantic communication system and achieves better performance in image recovery over various channel conditions.

IV-D Complexity Analysis

We analyze the computational complexity of the proposed DM-MIMO in terms of multiply accumulate operations (MACs). Table I shows the number of MACs for one step of sampling in DM-MIMO at different CBRs. It can be seen that the amount of MACs of DM-MIMO increases proportionally with the square of CBR. This is expected as the number of channels in U-Net adopted by DM-MIMO directly depends on the length of the received signal. Meanwhile, the number of MACs in JSCC only increases slightly when CBR increases. As such, our DM-MIMO is more suitable at lower CBR region in terms of computational complexity.

V Conclusion

In this paper, we propose a plug-in channel denoising module named DM-MIMO, aiming at enhancing the robustness of semantic communication systems over MIMO channels. By learning the distribution of the encoded signal, DM-MIMO eliminates noise and reduces power fluctuations of the decoder input signal, thereby enhancing the robustness of the semantic communication system. To address the diversity of sub-channel conditions, we employ effective sampling steps correspondingly, and devise a joint sampling algorithm to leverage all the received semantic information while managing the variations in effective noise power. Experimental results demonstrate that DM-MIMO-JSCC outperforms JSCC without DM-MIMO in image recovery.

TABLE I: MACs of DM-MIMO and JSCC with different CBRs.

CBR	DM-MIMO	JSCC
0.0026	6.762 G	32.723 G
0.0039	15.215 G	32.726 G
0.0078	60.858 G	32.736 G
0.0104	108.193 G	32.742 G

References

[1] J. Xu, T.-Y. Tung, B. Ai, W. Chen, Y. Sun, and D. Gündüz, “Deep joint source-channel coding for semantic communications,” IEEE Communications Magazine, vol. 61, no. 11, pp. 42–48, 2023.
[2] H. Xie, Z. Qin, G. Y. Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,” IEEE Transactions on Signal Processing, vol. 69, pp. 2663–2675, 2021.
[3] S. Wang, J. Dai, Z. Liang, K. Niu, Z. Si, C. Dong, X. Qin, and P. Zhang, “Wireless deep video semantic transmission,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 214–229, 2023.
[4] P. Zhang, W. Xu, H. Gao, K. Niu, X. Xu, X. Qin, C. Yuan, Z. Qin, H. Zhao, J. Wei, and F. Zhang, “Toward wisdom-evolutionary and primitive-concise 6G: A new paradigm of semantic communication networks,” Engineering, vol. 8, pp. 60–73, 2022.
[5] H. Wu, Y. Shao, C. Bian, K. Mikolajczyk, and D. Gündüz, “Deep joint source-channel coding for adaptive image transmission over MIMO channels,” arXiv:2309.00470, 2023.
[6] G. Zhang, Q. Hu, Y. Cai, and G. Yu, “SCAN: Semantic communication with adaptive channel feedback,” IEEE Transactions on Cognitive Communications and Networking, pp. 1–1, 2024.
[7] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” International Conference on Learning Representations, 2021.
[8] S. F. Yilmaz, X. Niu, B. Bai, W. Han, L. Deng, and D. Gunduz, “High perceptual quality wireless image delivery with denoising diffusion models,” arXiv:2309.15889, 2023.
[9] E. Grassucci, C. Marinoni, A. Rodriguez, and D. Comminiello, “Diffusion models for audio semantic communication,” ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13 136–13 140, 2024.
[10] T. Wu, Z. Chen, D. He, L. Qian, Y. Xu, M. Tao, and W. Zhang, “CDDM: Channel denoising diffusion models for wireless semantic communications,” IEEE Transactions on Wireless Communications, pp. 1–1, 2024.
[11] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 11 461–11 471.
[12] K. Yang, S. Wang, J. Dai, K. Tan, K. Niu, and P. Zhang, “WITT: A wireless image transmission transformer for semantic communications,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 2015, pp. 234–241.