MASSFormer: Mobility-Aware Spectrum Sensing using Transformer-Driven Tiered Structure

Dimpal Janu,   Sandeep Mandia,  Kuldeep Singh,  and Sandeep Kumar D. Janu, and K. Singh are with the Department of Electronics and Communication Engineering, Malaviya National Institute of Technology Jaipur, 302017,India (e-mail: [email protected];[email protected])S. Mandia is with Thapar Institute of Engineering &\&& Technology, Patiala, 147004, India (e-mail: [email protected])S. Kumar is with Central Research Lab, Bharat Electronics Ltd., Ghaziabad, 201010, India (e-mail: [email protected])
Abstract

In this paper, we develop a novel mobility-aware transformer-driven tiered structure (MASSFormer) based cooperative spectrum sensing method that effectively models the spatio-temporal dynamics of user movements. Unlike existing methods, our method considers a dynamic scenario involving mobile primary users (PUs) and secondary users (SUs)and addresses the complexities introduced by user mobility. The transformer architecture utilizes an attention mechanism, enabling the proposed method to adeptly model the temporal dynamics of user mobility by effectively capturing long-range dependencies within the input data. The proposed method first computes tokens from the sequence of covariance matrices (CMs) for each SU and processes them in parallel using the SU-transformer network to learn the spatio-temporal features at SU-level. Subsequently, the collaborative transformer network learns the group-level PU state from all SU-level feature representations. The attention-based sequence pooling method followed by the transformer encoder adjusts the contributions of all tokens. The main goal of predicting the PU states at each SU-level and group-level is to improve detection performance even more. We conducted a sufficient amount of simulations and compared the detection performance of different SS methods. The proposed method is tested under imperfect reporting channel scenarios to show robustness. The efficacy of our method is validated with the simulation results demonstrating its higher performance compared with existing methods in terms of detection probability, sensing error, and classification accuracy.

Index Terms:
Spectrum Sensing, Transformer, Self-attention, Mobility, Fading Channels.

I Introduction

The fifth generation (5G) mobile communication systems confront new requirements and challenges in terms of ultra-large range, massive connection, and ultra-low latency due to the exponential growth of mobile data traffic and the excessive connecting devices. The increasing use of wireless networks worldwide has caused a shortage of radio spectrum. Cognitive Radio (CR) networks offer a potential solution by allowing secondary users (SUs) to opportunistically access the spectrum allocated to primary users (PUs) when it is not in use. Therefore, spectrum sensing (SS) has gained intense attention from academia in recent decades. Numerous SS approaches have been proposed, with energy detection (ED) being the most commonly used method due to its low structure and minimal processing complexity [1].

Various deep/machine learning (DL/ML) methods have been used recently to enhance cooperative spectrum sensing (CSS) detection performance. In [2], several ML-based SS methods have been proposed. A detailed analysis of ML-based approaches applied to CSS has been provided in [3]. This paper examines the strengths and limitations associated with these ML-based approaches, focusing on aspects such as performance improvement, variety of features used, applicable scenarios and complexity. There are certain limitations associated with ML-based approaches, including the necessity for manually crafted features during the ML model training process, as these features may not effectively capture the complexities of the real environment. Nowadays, DL-based methods have been applied to wireless communication demonstrating remarkable performance. Specifically, DL-based spectrum sensing has started to make a notable impact. Convolutional neural networks (CNNs) possess a robust ability to extract spatial features from data in matrix format. The covariance matrices (CMs) are regarded as a versatile statistical measure, encompassing correlation features of sensing signals of SUs. Consequently, CNNs have been widely applied in SS to acquire correlation features from CMs. Liu et. al. have designed CNN-based approaches in [4] and [5], where CMs were fed as input to derive the test statistics, and these two works achieved performance improvement in CSS. The SS methodology discussed in [6] integrates the CNNs and short Time Fourier transform (STFT) to predict PU states leveraging time-frequency domain information. Graph convolutional neural networks (GCN) possess a robust capability to capture relationships among graph nodes by encoding the structural information of non-grid data. Additionally, it can concurrently incorporate graphs of variable sizes. Leveraging the distinctive ability of GCN, an SS method has been proposed in [7] to solve the hidden node problem. In [8], the authors used a CNN to provide a suitable alternative solution to the likelihood ratio test (LRT) under various general noises, which include Middleton class A, isometric complex symmetric α𝛼\alphaitalic_α -stable, and isometric complex generalized Gaussian noise.

The methodologies mentioned above leverage sensing data produced during the current sensing duration for the prediction of PU states while refraining from incorporating sensing data from preceding durations. These methods utilize CNN architecture to learn the graphical features from CMs of sensed signals. Concurrently, researchers have also concentrated on capturing temporal information by leveraging historical data. An approach based on activity pattern aware spectrum sensing (APASS) has been developed in [9] in order to enhance the sensing performance even further. This method involves the simultaneous deployment of two parallel CNN structures, utilizing CMs generated from both current and past sensing data to learn both graphical and temporal features. A limited number of existing works in literature have applied the Long Short-Term Memory (LSTM) network, demonstrating its ability to learn temporal features through the utilization of historical sensing data across numerous sensing durations. In [10], an SS model that applies the LSTM network to capture temporal correlation features from sequential data of current and past sensing events has been proposed. Additionally, the model employed statistical insights of PU activity such as PU ON and OFF periods’ durations as well as duty cycle to improve the sensing performance. Further, an LSTM-based SS approach has been developed in [11] to learn temporal information from the CMs. Authors of [12] have developed a spectrum prediction method that utilizes the Taguchi method to optimize the hyperparameter of the LSTM network. Considering the single antenna SU scenarios, authors of [13] and [14] have developed an SS method employing a hybrid structure of CNN-LSTM to extract the spatio-temporal information from observations of single sensing intervals. A combination of 1-D CNN and LSTM network has been employed to learn the time and frequency domain features, to find the presence of PU specifically in low signal-to-noise ratio (SNR) environments [15]. In order to enhance the SS performance further, xie et. al. have introduced a CNN-LSTM method in [16]. Initially, the detector employs CNN to capture spatial information from CMs derived from sensing signals. Subsequently, the features extracted from different sensing durations are inputted into the LSTM network to determine the PU activity pattern. An SS model utilizing CNN and LSTM network has been proposed in [17], where a 1-D signal vector is inputted to CNN to learn graphical features, and then learned features are provided as input to the LSTM network to capture the temporal features.

The methodologies discussed above have utilized an LSTM network and a hybrid CNN-LSTM network to learn temporal information for predicting the PU states. While some LSTM-based CSS methods learn temporal features from observations collected during one sensing duration. However, such features are comparatively less precise than those obtained by extracting temporal features across several sensing periods to find the activity pattern of PU. The two activities, predicting PU states at individual SUs and at the fusion centre, occur simultaneously over time. However, the methods discussed above treat all sensing outcomes from participating SUs with equal importance, which can misguide the PU activity prediction results by overstating the importance of irrelevant features. In general, these methods face challenges to effectively model long-range dependence in the input sequence and capture comprehensive spatio-temporal features among cooperating SUs. Moreover, all the above SS methods have assumed static SUs, and none of these methods have considered mobility scenarios. However, in real-world scenarios, PUs and SUs are mobile, and their mobility has a significant impact on detection performance. In dynamic wireless environments, accurately detecting PU activity patterns is challenging due to severe path loss and channel fading resulting from user mobility.

Based on the above observation, we consider a mobile scenario having multiple SUs equipped with multiple antenna and propose a MASSFormer based method to effectively model the spatio-temporal dynamics of users by utilizing historical sensing data and exploiting user mobility patterns. Since transformer [18], an attention mechanism based architecture can address the challenges that LSTM and RNN networks struggle with, particularly in effectively modeling long-range dependencies in input sequences. Vision transformer (ViT) [19], a transformer based model, was first proposed for image classification and sparked considerable interest within the research community, leading to subsequent works and its extension to video vision transformer (ViViT) in [20]. Our proposed MASSFormer method is inspired by ViViT, which utilizes an SU-transformer network to learn relevant spatio-temporal features from the sensing outcomes of each SU over several sensing durations. Subsequently, a collaborative transformer network is utilized to model the SU-level spatio-temporal features of all participating SUs to learn group-level features to predict PU states. Attention across multiple levels of the transformer layers can effectively capture more relevant features from the individual SUs in a progressive manner. With the considered scenario, the CNN-LSTM method [16] captures the temporal information from all SUs concurrently using a single LSTM, thereby neglecting temporal information at individual SU-level. Similarly, the 3333-D CNN [21] method fails to capture the spatio-temporal information at individual SU-level.

We summarize the main contributions of the paper as follows:

  • We develop a novel MASSFormer method that effectively models the spatio-temporal dynamics of user movements. The developed method first uses an SU-transformer network which uses the transformer encoder with attention mechanism to learn the spatio-temporal features of each SU to predict the PU states at SU-level and a collaborative transformer network to model the learned representations from SUs to predict the PU states at group-level.

  • We adopt a practical system model having multiple SUs with multiple antennas. We address the challenges posed by the mobile users in real-world scenarios by accounting for varying levels of path loss and fading severity at each SU. We consider the impact of imperfect reporting channels between SUs and fusion centre to show the robustness of our approach.

  • With extensive simulations, we compared and analyzed the detection performance of the proposed method with state-of-the-art methods. We have validated that the proposed method outperforms the state-of-the-art methods in considered mobile scenarios in terms of detection probability, sensing error, and classification accuracy.

II System model

We consider that a single PU with a single antenna and S𝑆Sitalic_S SUs with multiple antenna are distributed randomly in a predefined area of the CR network. In the considered mobile scenarios, it is assumed that users are moving with random velocity, resulting in dynamic changes to their locations over time, and their movements are independent of each other. At the beginning of each frame, each SU conducts SS and gathers N𝑁Nitalic_N observations at each antenna during u𝑢uitalic_u-th sensing intervals. We formulate two hypotheses regarding the states of PU, where H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote active state and inactive states, respectively. The signal ysm(n)superscriptsubscript𝑦𝑠𝑚𝑛y_{s}^{m}(n)italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) from m𝑚mitalic_m-th antenna s𝑠sitalic_s-th SU is received at fusion centre represented as

ysm(n)={hs,rm(n)(hsm(n)w(n)+ηsm(n))+ηc(n),1hs,rm(n)ηsm(n)+ηc(n),0superscriptsubscript𝑦𝑠𝑚𝑛casessuperscriptsubscript𝑠𝑟𝑚𝑛superscriptsubscript𝑠𝑚𝑛𝑤𝑛superscriptsubscript𝜂𝑠𝑚𝑛subscript𝜂𝑐𝑛subscript1superscriptsubscript𝑠𝑟𝑚𝑛superscriptsubscript𝜂𝑠𝑚𝑛subscript𝜂𝑐𝑛subscript0y_{s}^{m}(n)=\\ \begin{cases}{h_{s,r}^{m}(n)(h_{s}^{m}(n)w(n)+\eta_{s}^{m}(n))}+\eta_{c}(n),&% \mathcal{H}_{1}\\ h_{s,r}^{{m}}(n)\eta_{s}^{{m}}(n)+\eta_{c}(n),&\mathcal{H}_{0}\end{cases}italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) = { start_ROW start_CELL italic_h start_POSTSUBSCRIPT italic_s , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) ( italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) italic_w ( italic_n ) + italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) ) + italic_η start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n ) , end_CELL start_CELL caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUBSCRIPT italic_s , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) + italic_η start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n ) , end_CELL start_CELL caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW (1)

Where, PU signal is denoted as w(n)𝑤𝑛w(n)italic_w ( italic_n ), m=1,2,,M𝑚12𝑀m=1,2,...,Mitalic_m = 1 , 2 , … , italic_M, s=1,2,,S𝑠12𝑆s=1,2,...,Sitalic_s = 1 , 2 , … , italic_S and n=1,2,3,,N𝑛123𝑁n=1,2,3,...,Nitalic_n = 1 , 2 , 3 , … , italic_N. ηsm(n)superscriptsubscript𝜂𝑠𝑚𝑛\eta_{s}^{m}(n)italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) and hsm(n)superscriptsubscript𝑠𝑚𝑛h_{s}^{m}(n)italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) denotes the noise signal and channel gain between PU and m𝑚mitalic_m-th antenna of s𝑠sitalic_s-th SU, respectively. The noise signal and channel gain received at fusion centre from m𝑚mitalic_m-th antenna of s𝑠sitalic_s-th SU is represented as ηc(n)subscript𝜂𝑐𝑛\eta_{c}(n)italic_η start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n ) and hs,rm(n)superscriptsubscript𝑠𝑟𝑚𝑛h_{s,r}^{m}(n)italic_h start_POSTSUBSCRIPT italic_s , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ), respectively. It is assumed that signals traveling through sensing as well as reporting channel may undergo different amounts of fading, which is quantified by varying fading parameter values. Channel gains of sensing channel and reporting channel are assumed to follow fading distribution over multiple sensing periods and during a particular sensing period it remains constant as sensing period is shorter than channel coherence time.

II-A Mobility scenario

Mobility scenarios involve the movement of both PUs and SUs within the network environment. PUs mobility affects the prediction of PU states due to changes in PU activity patterns with their movements. The PU signals reaching SUs encounter varying levels of path loss and different channel conditions due to SUs’ mobility. In this work, we consider the Random Waypoint (RW) Mobility Model [22] to find the movement patterns, trajectories, speed, direction, and location information of PUs and SUs. According to the RW mobility model, PU and SUs are initially distributed randomly in the simulation area. The users wait for pause time by staying in one location, and once this time expires, they choose a random destination in the simulation area and a speed that is uniformly distributed [vmin,vmax]subscript𝑣𝑚𝑖𝑛subscript𝑣𝑚𝑎𝑥[v_{min},v_{max}][ italic_v start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ]. The users move to the new location at the selected speed and wait for a pause time after arrival. Fig. 1 shows the movement pattern of PU and SUs within a defined simulation area. Initially the position of moving user i.e. PU or SUs is represented as (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), after a time-interval ΔtΔ𝑡\Delta troman_Δ italic_t, the position is updated using the following formulas:

X(t+Δt)=X(t)+vΔtcosθ(t)𝑋𝑡Δ𝑡𝑋𝑡𝑣Δ𝑡𝑐𝑜𝑠𝜃𝑡X(t+\Delta t)=X(t)+v{\Delta t}\ast cos\theta(t)italic_X ( italic_t + roman_Δ italic_t ) = italic_X ( italic_t ) + italic_v roman_Δ italic_t ∗ italic_c italic_o italic_s italic_θ ( italic_t ) (2)
Y(t+Δt)=Y(t)+vΔtsinθ(t)𝑌𝑡Δ𝑡𝑌𝑡𝑣Δ𝑡𝑠𝑖𝑛𝜃𝑡Y(t+\Delta t)=Y(t)+v{\Delta t}\ast sin\theta(t)italic_Y ( italic_t + roman_Δ italic_t ) = italic_Y ( italic_t ) + italic_v roman_Δ italic_t ∗ italic_s italic_i italic_n italic_θ ( italic_t ) (3)

where speed v𝑣vitalic_v is randomly selected between [vmin,vmax]subscript𝑣𝑚𝑖𝑛subscript𝑣𝑚𝑎𝑥[v_{min},v_{max}][ italic_v start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ], θ(t)𝜃𝑡\theta(t)italic_θ ( italic_t ) denotes the direction at time t𝑡titalic_t. The position traces of PU and SUs are calculated using equation (2) and (3).

The instantaneous signal-to-noise ratio experienced by s𝑠sitalic_s-th SU is denoted as γs=|hs|2PtN0BW=PrsN0BWsubscript𝛾𝑠superscriptsubscript𝑠2𝑃𝑡subscript𝑁0subscript𝐵𝑊𝑃subscript𝑟𝑠subscript𝑁0subscript𝐵𝑊\gamma_{s}=\frac{{|h_{s}|}^{2}Pt}{N_{0}B_{W}}=\frac{Pr_{s}}{N_{0}B_{W}}italic_γ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG | italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_P italic_t end_ARG start_ARG italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_P italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_ARG. We assume that the transmit power of PU is fixed to Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and PU signals are transmitted through the channel whose bandwidth is BWsubscript𝐵𝑊B_{W}italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT. The received PU power at SU s𝑠sitalic_s at a distance ds(u)subscript𝑑𝑠𝑢d_{s}(u)italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u ) from the PU at time u𝑢uitalic_u can be expressed in dB as

Prs(dB)=Pt(dB){10log10(β)+10αlog10(ds(u))}𝑃subscript𝑟𝑠𝑑𝐵𝑃𝑡𝑑𝐵10𝑙𝑜subscript𝑔10𝛽10𝛼𝑙𝑜subscript𝑔10subscript𝑑𝑠𝑢Pr_{s}(dB)=Pt(dB)-\{10log_{10}(\beta)+10\alpha log_{10}(d_{s}(u))\}\\ italic_P italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_d italic_B ) = italic_P italic_t ( italic_d italic_B ) - { 10 italic_l italic_o italic_g start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_β ) + 10 italic_α italic_l italic_o italic_g start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u ) ) } (4)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β represent path-loss exponent and path-loss constant respectively, and N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes noise power spectral density.

Refer to caption
Figure 1: Movement pattern of PUs and SUs using RW Mobility model
Refer to caption
Figure 2: Architecture of MASSFormer

II-B Data Preprocessing

The proposed model requires labeled training dataset collected in U𝑈Uitalic_U sensing durations. The raw sensing signal 𝒀ussuperscriptsubscript𝒀𝑢𝑠\boldsymbol{Y}_{u}^{s}bold_italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT collected during a single sensing duration is too large with dimension M×N𝑀𝑁M\times Nitalic_M × italic_N, to be directly considered as input for the neural network. Most of the existing SS methods utilized CMs as test statistics. Therefore, it is required to compress the sensing signal into CMs to construct the training set. The signal matrix 𝒀ussuperscriptsubscript𝒀𝑢𝑠\boldsymbol{Y}_{u}^{s}bold_italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT composed of sensing signals from M𝑀Mitalic_M antennas of SU s𝑠sitalic_s to fusion centre can be expressed as

𝒀us=[ys1(1)ys1(2)ys1(N)ys2(1)ys2(2)ys2(N)ysM(1)ysM(2)ysM(N)]superscriptsubscript𝒀𝑢𝑠matrixsuperscriptsubscript𝑦𝑠11superscriptsubscript𝑦𝑠12superscriptsubscript𝑦𝑠1𝑁superscriptsubscript𝑦𝑠21superscriptsubscript𝑦𝑠22superscriptsubscript𝑦𝑠2𝑁missing-subexpressionsuperscriptsubscript𝑦𝑠𝑀1superscriptsubscript𝑦𝑠𝑀2superscriptsubscript𝑦𝑠𝑀𝑁\boldsymbol{Y}_{u}^{s}=\begin{bmatrix}y_{s}^{1}(1)&y_{s}^{1}(2)&\cdots&y_{s}^{% 1}(N)\\ y_{s}^{2}(1)&y_{s}^{2}(2)&\cdots&y_{s}^{2}(N)\\ \vdots&\vdots&\ddots&\vdots&\\ y_{s}^{M}(1)&y_{s}^{M}(2)&\cdots&y_{s}^{M}(N)\\ \end{bmatrix}bold_italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 1 ) end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 2 ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_N ) end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 ) end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_N ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 ) end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 2 ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_N ) end_CELL end_ROW end_ARG ] (5)

Let us construct the CM of the signal matrix as defined below

𝑹us=1N𝒀us𝒀usHsuperscriptsubscript𝑹𝑢𝑠1𝑁superscriptsubscript𝒀𝑢𝑠superscriptsuperscriptsubscript𝒀𝑢𝑠𝐻\boldsymbol{R}_{u}^{s}=\frac{1}{N}\boldsymbol{Y}_{u}^{s}{\boldsymbol{Y}_{u}^{s% }}^{H}bold_italic_R start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG bold_italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT (6)

Now, construct the training set by arranging the CMs as defined as {(𝑹1s,b1),(𝑹2s,b2),(𝑹3s,b3)..(𝑹Us,bU)},\{(\boldsymbol{R}^{s}_{1},b_{1}),(\boldsymbol{R}_{2}^{s},b_{2}),(\boldsymbol{R% }_{3}^{s},b_{3})...........(\boldsymbol{R}_{U}^{s},b_{U})\},{ ( bold_italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( bold_italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ( bold_italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) … … … . . ( bold_italic_R start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } , where U𝑈Uitalic_U denotes total examples in the set. The class of u𝑢uitalic_u-th example is denoted as bu{0,1}subscript𝑏𝑢01b_{u}\in\{0,1\}italic_b start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ { 0 , 1 }, where the hypotheses 1subscript1\mathcal{H}_{1}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are selected when bu=1subscript𝑏𝑢1b_{u}=1italic_b start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 1 and bu=0subscript𝑏𝑢0b_{u}=0italic_b start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 0, respectively. To predict the PU states at SU-level, we concurrently analyze the CMs of sensed signals gathered at slimit-from𝑠s-italic_s -th SU during numerous sensing intervals. The dataset for s𝑠sitalic_s-th SU is written as

Φs={(ϕ1s,bλ),(ϕ2s,b2λ),,(ϕus,buλ),..,(ϕU/λs,bU)}{\Phi}^{s}=\{(\boldsymbol{\phi}_{1}^{s},b_{\lambda}),(\boldsymbol{\phi}_{2}^{s% },b_{2\lambda}),...,(\boldsymbol{\phi}_{u}^{s},b_{u\lambda}),..,(\boldsymbol{% \phi}_{U/\lambda}^{s},b_{U})\}roman_Φ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { ( bold_italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) , ( bold_italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT 2 italic_λ end_POSTSUBSCRIPT ) , … , ( bold_italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_u italic_λ end_POSTSUBSCRIPT ) , . . , ( bold_italic_ϕ start_POSTSUBSCRIPT italic_U / italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } (7)

where, ϕus=[𝑹λ(u1)+1s,𝑹λ(u1)+2s,.,𝑹λ(u1)+λs]\boldsymbol{\phi}^{s}_{u}=[\boldsymbol{R}^{s}_{\lambda(u-1)+1},\boldsymbol{R}^% {s}_{\lambda(u-1)+2},....,\boldsymbol{R}^{s}_{\lambda(u-1)+\lambda}]bold_italic_ϕ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = [ bold_italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ ( italic_u - 1 ) + 1 end_POSTSUBSCRIPT , bold_italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ ( italic_u - 1 ) + 2 end_POSTSUBSCRIPT , … . , bold_italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ ( italic_u - 1 ) + italic_λ end_POSTSUBSCRIPT ] and λ𝜆\lambdaitalic_λ denotes the length of input’s temporal sequence. The u𝑢uitalic_u-th example in the set of s𝑠sitalic_s-th SU is denoted as ϕussubscriptsuperscriptbold-italic-ϕ𝑠𝑢\boldsymbol{\phi}^{s}_{u}bold_italic_ϕ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The complete dataset for s𝑠sitalic_s-th SU is represented as ΦssuperscriptΦ𝑠{\Phi}^{s}roman_Φ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. To generate data for all SUs, we follow the same method as described in equation (5), (6), and (7). Since every participating SU sends its sensing data to the fusion centre, the complete dataset constructed by collecting samples from all SUs is defined as

𝚽={(ϕ11,ϕ12,..,ϕ1S,bλ),(ϕ21,ϕ22,..,ϕ2S,b2λ)\displaystyle\boldsymbol{\Phi}=\{(\boldsymbol{\phi}^{1}_{1},\boldsymbol{\phi}^% {2}_{1},..,\boldsymbol{\phi}^{S}_{1},b_{\lambda}),(\boldsymbol{\phi}^{1}_{2},% \boldsymbol{\phi}^{2}_{2},..,\boldsymbol{\phi}^{S}_{2},b_{2\lambda})bold_Φ = { ( bold_italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , bold_italic_ϕ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) , ( bold_italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , bold_italic_ϕ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 italic_λ end_POSTSUBSCRIPT )
..,(ϕu1,ϕu2,..,ϕuS,buλ)..,(ϕU/λ1,ϕU/λ2,..,ϕU/λS,bU)}\displaystyle..,(\boldsymbol{\phi}^{1}_{u},\boldsymbol{\phi}^{2}_{u},..,% \boldsymbol{\phi}^{S}_{u},b_{u\lambda})..,(\boldsymbol{\phi}^{1}_{U/\lambda},% \boldsymbol{\phi}^{2}_{U/\lambda},..,\boldsymbol{\phi}^{S}_{U/\lambda},b_{U})\}. . , ( bold_italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , . . , bold_italic_ϕ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_u italic_λ end_POSTSUBSCRIPT ) . . , ( bold_italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_U / italic_λ end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_U / italic_λ end_POSTSUBSCRIPT , . . , bold_italic_ϕ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_U / italic_λ end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } (8)

III MASSFormer based framework

The equation (II-B) denotes the complete dataset, where each sample in the set consists of data samples from all participating SUs. The data point ϕ1ssuperscriptsubscriptbold-italic-ϕ1𝑠\boldsymbol{\phi}_{1}^{s}bold_italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT in equation (7), consists of a sequence of CMs computed for each of SU is represented as λ×M×Msuperscript𝜆𝑀𝑀\mathbb{R}^{\lambda\times M\times M}blackboard_R start_POSTSUPERSCRIPT italic_λ × italic_M × italic_M end_POSTSUPERSCRIPT. Since, the transformer takes input in tokenized format, which is discussed in section III-A.

III-A Tokenization

In this section, we discuss the tokenization process of ViViT, as we adopt a similar approach for tokenization in our methodology. A video sample is denoted as XT×H×W×C𝑋superscript𝑇𝐻𝑊𝐶X\in{\mathbb{R}^{T\times H\times W\times C}}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, where T𝑇Titalic_T, H𝐻Hitalic_H, W𝑊Witalic_W, and C𝐶Citalic_C referring to the temporal length, height, width and depth of the input respectively. The ViT-based model processes the 2222-D images to extract Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT non-overlapping patches. In the context of videos, a sequence of spatio-temporal "tubes" x1,x2xNtsubscript𝑥1subscript𝑥2subscript𝑥subscript𝑁𝑡x_{1},x_{2}......x_{N_{t}}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … … italic_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, xt×h×w𝑥superscript𝑡𝑤{x}\in{\mathbb{R}}^{{t}\times{h}\times{w}}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_t × italic_h × italic_w end_POSTSUPERSCRIPT are extracted from the input volume as described in ViViT [20]. A linear operator E𝐸Eitalic_E is applied to the tubes to linearly project them to 1111-D tokens zNt×d𝑧superscriptsubscript𝑁𝑡𝑑{z}\in\mathbb{R}^{N_{t}\times d}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, where Nt=ntnhnwsubscript𝑁𝑡subscript𝑛𝑡subscript𝑛subscript𝑛𝑤N_{t}=n_{t}n_{h}n_{w}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, nt=Tt,nh=Hhformulae-sequencesubscript𝑛𝑡𝑇𝑡subscript𝑛𝐻n_{t}=\lfloor\frac{T}{t}\rfloor,n_{h}=\lfloor\frac{H}{h}\rflooritalic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⌊ divide start_ARG italic_T end_ARG start_ARG italic_t end_ARG ⌋ , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ⌊ divide start_ARG italic_H end_ARG start_ARG italic_h end_ARG ⌋, and nw=Wwsubscript𝑛𝑤𝑊𝑤n_{w}=\lfloor\frac{W}{w}\rflooritalic_n start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = ⌊ divide start_ARG italic_W end_ARG start_ARG italic_w end_ARG ⌋. To perform linear projection, 3333-D convolution is used with the kernel size (t,h,w)𝑡𝑤(t,h,w)( italic_t , italic_h , italic_w ) and the strides (t,h,w)𝑡𝑤(t,h,w)( italic_t , italic_h , italic_w ) in time, height, and width dimensions, respectively. In ViViT, a learnable token zclssubscript𝑧𝑐𝑙𝑠z_{cls}italic_z start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT is prepended to the token sequence. However, to incorporate the information from all the tokens, we apply attention based sequence pooling operation to compute a 1-D feature vector from 2-D encoder output. As the self-attention mechanism in the transformer encoder is order agnostic, positional embedding PNt×d𝑃superscriptsubscript𝑁𝑡𝑑P\in\mathbb{R}^{N_{t}\times d}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT is added to tokens to preserve the positional information. Therefore, we extract tokens from the sequence of CMs constructed for each of the SUs as depicted in Fig. 2. The sequence of tokens for s𝑠sitalic_s-th SU is denoted as

zs=[Ex1,Ex2,Ex3ExNt+P]subscript𝑧𝑠𝐸subscript𝑥1𝐸subscript𝑥2𝐸subscript𝑥3𝐸subscript𝑥subscript𝑁𝑡𝑃{z_{s}}=[Ex_{1},Ex_{2},Ex_{3}......Ex_{N_{t}}+P]italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = [ italic_E italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT … … italic_E italic_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_P ] (9)

III-B Transformer encoder

The SU-transformer network and collaborative transformer network are made up of several transformer encoder layers. The encoder layer consists of multi-head self-attention (MSA) and multi-layer perceptron (MLP) in sequence with skip connections. The sequence of tokens z𝑧zitalic_z corresponding to each SU is processed by the SU-transformer network which consists of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT encoder layers. Each encoder layer is applied sequentially which consists of the following operations,

cl=MSA(LN(zl1))+zl1,superscript𝑐𝑙𝑀𝑆𝐴𝐿𝑁superscript𝑧𝑙1superscript𝑧𝑙1c^{l}=MSA(LN(z^{l-1}))+z^{l-1},italic_c start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_M italic_S italic_A ( italic_L italic_N ( italic_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) ) + italic_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , (10)
zl=MLP(LN(cl))+clsuperscript𝑧𝑙𝑀𝐿𝑃𝐿𝑁superscript𝑐𝑙superscript𝑐𝑙z^{l}=MLP(LN(c^{l}))+c^{l}italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_M italic_L italic_P ( italic_L italic_N ( italic_c start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) + italic_c start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (11)

where LN is layer normalization [23], MSA is multi-head self attention [18] and MLP [24] consists of two dense layer separated by GeLU non-linearity. The self-attention aims to capture interactions among all Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT token embeddings. The self-attention is computed as

Zatt=Softmax(Q(K)Tdk)subscript𝑍𝑎𝑡𝑡Softmax𝑄superscript𝐾𝑇subscript𝑑𝑘Z_{att}=\text{Softmax}\left(\frac{Q\left(K\right)^{T}}{\sqrt{d_{k}}}\right)italic_Z start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT = Softmax ( divide start_ARG italic_Q ( italic_K ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) (12)

The output of the self-attention block is obtained by multiplying Zattsubscript𝑍𝑎𝑡𝑡Z_{att}italic_Z start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT with the value matrix V𝑉Vitalic_V. MSA allows the model to concurrently attend to information from various representation subspaces at different positions. The output of MSA block is computed as:

MultiHead(Q,K,V)=Concat(head1,head2headhd)WO,MultiHead𝑄𝐾𝑉𝐶𝑜𝑛𝑐𝑎𝑡𝑒𝑎subscript𝑑1𝑒𝑎subscript𝑑2𝑒𝑎subscript𝑑subscript𝑑superscript𝑊𝑂\text{MultiHead}(Q,K,V)=Concat(head_{1},head_{2}...head_{h_{d}})W^{O},MultiHead ( italic_Q , italic_K , italic_V ) = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT , (13)

where

headi=Softmax(QWQi(KWKi)Tdk/hd)VWVisubscripthead𝑖Softmax𝑄superscript𝑊subscript𝑄𝑖superscript𝐾superscript𝑊subscript𝐾𝑖𝑇subscript𝑑𝑘subscript𝑑𝑉superscript𝑊subscript𝑉𝑖\text{head}_{i}=\text{Softmax}\left(\frac{QW^{Q_{i}}\left(KW^{K_{i}}\right)^{T% }}{\sqrt{d_{k}/h_{d}}}\right)VW^{V_{i}}head start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Softmax ( divide start_ARG italic_Q italic_W start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_K italic_W start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V italic_W start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (14)

where WQisuperscript𝑊subscript𝑄𝑖W^{Q_{i}}italic_W start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT \in Rd×dqsuperscript𝑅𝑑subscript𝑑𝑞R^{d\times d_{q}}italic_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, WKisuperscript𝑊subscript𝐾𝑖W^{K_{i}}italic_W start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT \in Rd×dksuperscript𝑅𝑑subscript𝑑𝑘R^{d\times d_{k}}italic_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and WVisuperscript𝑊subscript𝑉𝑖W^{V_{i}}italic_W start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT \in Rd×dvsuperscript𝑅𝑑subscript𝑑𝑣R^{d\times d_{v}}italic_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and WOsuperscript𝑊𝑂W^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT \in Rhddv×dsuperscript𝑅subscript𝑑subscript𝑑𝑣𝑑R^{h_{d}d_{v}\times d}italic_R start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, dv=dk=d/hdsubscript𝑑𝑣subscript𝑑𝑘𝑑subscript𝑑d_{v}=d_{k}=d/h_{d}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d / italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. i𝑖iitalic_i denotes number of heads ranging from 1111 through hdsubscript𝑑h_{d}italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and hdsubscript𝑑h_{d}italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT being the total number of heads in MSA block.

III-C Architecture of MASSFormer

In this section, we provide a brief introduction to the proposed MASSFormer method. The main objective of the proposed architecture is to learn the collective PU states by selectively extracting the most relevant features from the sensing data of all cooperating SUs. The proposed MASSFormer model is divided into two components i.e. the SU-transformer network and collaborative transformer network to predict PU states by extracting spatio-temporal features at individual SU-level and at group-level respectively. The proposed MASSFormer architecture is depicted in Fig. 2, where SU-transformer network is used to predict the SU-level PU states by extracting spatio-temporal features from the sequence of tokens extracted from a series of CMs of a s𝑠sitalic_s-th SU. Further, collaborative transformer network is used to model the output representations of all SUs to predict the PU states at the group-level. Attention from various levels of a transformer can extract contributing features from the participating SUs in a progressive manner.

To predict the PU states, tokens computed from samples of each SU are fed to SU-transformer network in different parallel pipelines. SU-transformer network consists of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT encoder layers, followed by sequence pooling as described in section III-D and MLP block to predict the PU states at SU-level. We concatenate the output feature representations Kssubscript𝐾𝑠K_{s}italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of each s𝑠sitalic_s-th SU.

Gt=K1K2KsKSsubscript𝐺𝑡subscript𝐾1subscript𝐾2subscript𝐾𝑠subscript𝐾𝑆G_{t}=K_{1}\diamond K_{2}...\diamond K_{s}...\diamond K_{S}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋄ italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … ⋄ italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT … ⋄ italic_K start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT (15)

Further, max pooling operation is performed over the feature Kssubscript𝐾𝑠K_{s}italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of all SUs to get output representation Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT provided in equation (15). Instead of spatial pooling (like typical max pooling in CNNs, which pools over spatial dimensions), this operation pools over the dimension representing different SU and combines their predictions by selecting the maximum value for each position across SU dimension. The output features Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is fed to collaborative transformer network which consists of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT transformer encoder layer, followed by sequence pooling and MLP block, to extract the group-level features from all the participating SU. Finally, the output features are fed to Softmax layer for predicting the activity states of PU.

III-D Sequence pooling

For the prediction of class label, ViViT [20] forwards a learnable class token through the encoder layers and later to the MLP for classification. In contrast, we use sequence pooling, first proposed by Hassani et. al. [25] to extract representation. Sequence pooling, an attention-based method, assigns weights to sequential embeddings, transforming the input sequence into a vector representation. This involves assigning importance weights to the processed data from transformer encoder. The motivation is rooted in the dispersion of information across all tokens in the output sequence, necessitating aggregation by assigning suitable weights to each token. Sequence pooling involves the transformation of sequence as T::𝑇absentT:italic_T : Nt×ddmaps-tosuperscriptsubscript𝑁𝑡𝑑superscript𝑑\mathbb{R}^{N_{t}\times d}\mapsto\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Given the output zNt×d𝑧superscriptsubscript𝑁𝑡𝑑z\in\mathbb{R}^{N_{t}\times d}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT of encoder is passed to linear layer g(z)Nt×1𝑔𝑧superscriptsubscript𝑁𝑡1g(z)\in\mathbb{R}^{N_{t}\times 1}italic_g ( italic_z ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, and softmax activation to the token sequence as follows

Weigths=softmax(g(z)T)1×Nt𝑊𝑒𝑖𝑔𝑡𝑠𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑔superscript𝑧𝑇superscript1subscript𝑁𝑡Weigths=softmax(g(z)^{T})\in\mathbb{R}^{1\times N_{t}}italic_W italic_e italic_i italic_g italic_t italic_h italic_s = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_g ( italic_z ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (16)

Hence, we calculate importance weights for each input token as described in equation (16), and further, these weights are assigned to each token as described in equation (17). The computed weights are employed to adjust the contributions of the tokens through a weighting operation as follows

zseq=Weigths×z1×dsubscript𝑧𝑠𝑒𝑞𝑊𝑒𝑖𝑔𝑡𝑠𝑧superscript1𝑑z_{seq}=Weigths\times z\in\mathbb{R}^{1\times d}italic_z start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT = italic_W italic_e italic_i italic_g italic_t italic_h italic_s × italic_z ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT (17)

zseqsubscript𝑧𝑠𝑒𝑞z_{seq}italic_z start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT is passed to MLP block of the SU-transformer network and collaborative transformer network to detect activity states of PU at SU-level and group-level respectively.

III-E Network training

The developed model is trained in two distinct stages. In the initial stage, the SU-transformer network is trained end-to-end using a training dataset comprising sequences of CMs converted into tokens and associated labels. This process aims to extract spatio-temporal features at SU-level. After the SU-transformer network gets trained, the output features, Kssubscript𝐾𝑠K_{s}italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, extracted for each SU during the initial stage are concatenated and further, a max pooling operation is applied to obtain pooled feature representation denoted as Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In the subsequent stage, the pooled features from multiple sensing periods are inputted into the collaborative transformer network which comprises various encoder layers, followed by sequence pooling and an MLP block.

The resulting output features are fed to the Softmax layer for the prediction of probabilities associated with hypotheses 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 1subscript1\mathcal{H}_{1}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The u𝑢uitalic_u-th example in the set contains data samples from each SU is denoted by symbol ΦusubscriptΦ𝑢{\Phi}_{u}roman_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, where Φu=(ϕu1,ϕu2ϕuS)subscriptΦ𝑢superscriptsubscriptitalic-ϕ𝑢1superscriptsubscriptitalic-ϕ𝑢2superscriptsubscriptitalic-ϕ𝑢𝑆{\Phi}_{u}=(\phi_{u}^{1},\phi_{u}^{2}...\phi_{u}^{S})roman_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = ( italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT … italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ). Normalized output of the developed model is represented as [jθ/0(𝚽u),jθ/1(𝚽u)]subscript𝑗𝜃subscript0subscript𝚽𝑢subscript𝑗𝜃subscript1subscript𝚽𝑢[j_{{\theta}/\mathcal{H}_{0}}(\boldsymbol{\Phi}_{u}),j_{{\theta}/\mathcal{H}_{% 1}}(\boldsymbol{\Phi}_{u})][ italic_j start_POSTSUBSCRIPT italic_θ / caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , italic_j start_POSTSUBSCRIPT italic_θ / caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ], where jθ/0(𝚽u)+jθ/1(𝚽u)=1subscript𝑗𝜃subscript0subscript𝚽𝑢subscript𝑗𝜃subscript1subscript𝚽𝑢1j_{{\theta}/\mathcal{H}_{0}}(\boldsymbol{\Phi}_{u})+j_{{\theta}/\mathcal{H}_{1% }}(\boldsymbol{\Phi}_{u})=1italic_j start_POSTSUBSCRIPT italic_θ / caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + italic_j start_POSTSUBSCRIPT italic_θ / caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) = 1. For a new data point 𝚽usubscript𝚽𝑢\boldsymbol{\Phi}_{u}bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, the trained model predicts a probability associated to either Hypothesis 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 1subscript1\mathcal{H}_{1}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. jθ/i(𝚽u)subscript𝑗𝜃subscript𝑖subscript𝚽𝑢j_{{\theta}/\mathcal{H}_{i}}(\boldsymbol{\Phi}_{u})italic_j start_POSTSUBSCRIPT italic_θ / caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) denotes the class probability of hypothesis isubscript𝑖\mathcal{H}_{i}caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Based on this, the main goal of training is to maximize the likelihood as

u=1U/λ(jθ|1(𝚽u))buλ(jθ|0(𝚽u))1buλ.superscriptsubscriptproduct𝑢1𝑈𝜆superscriptsubscript𝑗conditional𝜃subscript1subscript𝚽𝑢subscript𝑏𝑢𝜆superscriptsubscript𝑗conditional𝜃subscript0subscript𝚽𝑢1subscript𝑏𝑢𝜆\displaystyle\prod\limits_{u=1}^{U/\lambda}{{{({j_{\theta|{\mathcal{H}_{1}}}}(% {\boldsymbol{\Phi}_{u}))}^{b_{u\lambda}}}{{({j_{\theta|{\mathcal{H}_{0}}}}(% \boldsymbol{\Phi}_{u}))}^{1-{b_{u\lambda}}}}}}.∏ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U / italic_λ end_POSTSUPERSCRIPT ( italic_j start_POSTSUBSCRIPT italic_θ | caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_u italic_λ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_j start_POSTSUBSCRIPT italic_θ | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 1 - italic_b start_POSTSUBSCRIPT italic_u italic_λ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (18)

This is equivalent to minimizing the cost function, a cross-entropy loss function, which is defined as logarithm of the likelihood function and normalizing it with the number of training points.

L(θ)=1U/λu=1U/λbuλlogjθ|1(𝚽u)+(1buλ)logjθ|0(𝚽u)𝐿𝜃1𝑈𝜆superscriptsubscript𝑢1𝑈𝜆subscript𝑏𝑢𝜆subscript𝑗conditional𝜃subscript1subscript𝚽𝑢1subscript𝑏𝑢𝜆subscript𝑗conditional𝜃subscript0subscript𝚽𝑢\displaystyle L({\theta})=-\frac{1}{U/\lambda}\sum\limits_{u=1}^{U/\lambda}{{{% b_{u\lambda}}{\log j_{\theta|{\mathcal{H}_{1}}}}({\boldsymbol{\Phi}_{u})}}+{(1% -{b_{u\lambda})}}{{{\log j_{\theta|{\mathcal{H}_{0}}}}(\boldsymbol{\Phi}_{u})}}}italic_L ( italic_θ ) = - divide start_ARG 1 end_ARG start_ARG italic_U / italic_λ end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U / italic_λ end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_u italic_λ end_POSTSUBSCRIPT roman_log italic_j start_POSTSUBSCRIPT italic_θ | caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + ( 1 - italic_b start_POSTSUBSCRIPT italic_u italic_λ end_POSTSUBSCRIPT ) roman_log italic_j start_POSTSUBSCRIPT italic_θ | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) (19)

To achieve the maximum likelihood, the cross-entropy loss function is minimized to obtain the optimal parameters as

θ=argminθL(θ).superscript𝜃subscript𝜃𝐿𝜃{\theta^{*}}=\arg\min\limits_{\theta}L(\theta).italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L ( italic_θ ) . (20)

While training the developed model, Back propagation (BP) algorithm is used to compute gradients of the loss function, and applying Adam optimizer to optimize the model parameters to get well-trained network. The output of trained model is denoted as jθ|0(𝚽u)subscript𝑗conditionalsuperscript𝜃subscript0subscript𝚽𝑢j_{\theta^{*}|{\mathcal{H}_{0}}}(\boldsymbol{\Phi}_{u})italic_j start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) and jθ|1(𝚽u)subscript𝑗conditionalsuperscript𝜃subscript1subscript𝚽𝑢j_{\theta^{*}|{\mathcal{H}_{1}}}(\boldsymbol{\Phi}_{u})italic_j start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ). According to Neyman-Pearson (N-P) theorem, the optimal test statistics is likelihood ratio is represented as

βMASSFormer(𝚽u)=jθ|1(𝚽u)jθ|0(𝚽u).subscript𝛽𝑀𝐴𝑆𝑆𝐹𝑜𝑟𝑚𝑒𝑟subscript𝚽𝑢subscript𝑗conditional𝜃subscript1subscript𝚽𝑢subscript𝑗conditional𝜃subscript0subscript𝚽𝑢\beta_{MASSFormer}(\boldsymbol{\Phi}_{u})=\frac{j_{\theta|{\mathcal{H}_{1}}}(% \boldsymbol{\Phi}_{u})}{j_{\theta|{\mathcal{H}_{0}}}(\boldsymbol{\Phi}_{u})}.italic_β start_POSTSUBSCRIPT italic_M italic_A italic_S italic_S italic_F italic_o italic_r italic_m italic_e italic_r end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) = divide start_ARG italic_j start_POSTSUBSCRIPT italic_θ | caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG start_ARG italic_j start_POSTSUBSCRIPT italic_θ | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG . (21)

Upon the arrival of a new data point 𝚽usubscript𝚽𝑢\boldsymbol{\Phi}_{u}bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, the model output is used to determine the activity states of PU. However, this approach does not provide control over the false alarm probability pfasubscript𝑝𝑓𝑎p_{fa}italic_p start_POSTSUBSCRIPT italic_f italic_a end_POSTSUBSCRIPT. Therefore, we calculate the detection threshold γ𝛾\gammaitalic_γ using Monte-Carlo method that ensures the desired false alarm probability pfasubscript𝑝𝑓𝑎p_{fa}italic_p start_POSTSUBSCRIPT italic_f italic_a end_POSTSUBSCRIPT is achieved. We collect the sequence of CMs corresponding to hypothesis 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to build a set 𝚽0={𝚽~1,.,𝚽~Q~}\boldsymbol{\Phi}_{\mathcal{H}_{0}}=\{\tilde{\boldsymbol{\Phi}}_{1},....,% \tilde{\boldsymbol{\Phi}}_{\tilde{Q}}\}bold_Φ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … . , over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_Q end_ARG end_POSTSUBSCRIPT }, where total examples in set 𝚽0subscript𝚽subscript0\boldsymbol{\Phi}_{\mathcal{H}_{0}}bold_Φ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denoted by Q~~𝑄\tilde{Q}over~ start_ARG italic_Q end_ARG. Now, we fed the noisy data samples to the MASSFormer and get the expression of test statistics under 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT i.e. βMASSFormer|0subscript𝛽conditional𝑀𝐴𝑆𝑆𝐹𝑜𝑟𝑚𝑒𝑟subscript0\beta_{MASSFormer|{\mathcal{H}_{0}}}italic_β start_POSTSUBSCRIPT italic_M italic_A italic_S italic_S italic_F italic_o italic_r italic_m italic_e italic_r | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Further, reorganize the all these βMASSFormer|0subscript𝛽conditional𝑀𝐴𝑆𝑆𝐹𝑜𝑟𝑚𝑒𝑟subscript0\beta_{MASSFormer|{\mathcal{H}_{0}}}italic_β start_POSTSUBSCRIPT italic_M italic_A italic_S italic_S italic_F italic_o italic_r italic_m italic_e italic_r | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to construct a set 𝚽~MASSFormer|0subscript~𝚽conditional𝑀𝐴𝑆𝑆𝐹𝑜𝑟𝑚𝑒𝑟subscript0\tilde{\boldsymbol{\Phi}}_{MASSFormer|\mathcal{H}_{0}}over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT italic_M italic_A italic_S italic_S italic_F italic_o italic_r italic_m italic_e italic_r | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in ascending order such as 1pqQ~,for-all1𝑝𝑞~𝑄\forall 1\leq p\leq q\leq\tilde{Q},∀ 1 ≤ italic_p ≤ italic_q ≤ over~ start_ARG italic_Q end_ARG ,

βMASSFormer|0(𝚽~p)βMASSFormer|0(𝚽~q)subscript𝛽conditional𝑀𝐴𝑆𝑆𝐹𝑜𝑟𝑚𝑒𝑟subscript0subscript~𝚽𝑝subscript𝛽conditional𝑀𝐴𝑆𝑆𝐹𝑜𝑟𝑚𝑒𝑟subscript0subscript~𝚽𝑞\quad{\beta_{MASSFormer|{\mathcal{H}_{0}}}}\left({{{\tilde{\boldsymbol{\Phi}}_% {p}}}}\right)\leq{\beta_{MASSFormer|{\mathcal{H}_{0}}}}\left({{{\tilde{% \boldsymbol{\Phi}}_{q}}}}\right)italic_β start_POSTSUBSCRIPT italic_M italic_A italic_S italic_S italic_F italic_o italic_r italic_m italic_e italic_r | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ≤ italic_β start_POSTSUBSCRIPT italic_M italic_A italic_S italic_S italic_F italic_o italic_r italic_m italic_e italic_r | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) (22)

Lastly, we compute the detection threshold with constraint of pfasubscript𝑝𝑓𝑎p_{fa}italic_p start_POSTSUBSCRIPT italic_f italic_a end_POSTSUBSCRIPT as

γ=𝚽~MASSFormer|0(round(Q~(1pfa))).𝛾subscript~𝚽conditional𝑀𝐴𝑆𝑆𝐹𝑜𝑟𝑚𝑒𝑟subscript0𝑟𝑜𝑢𝑛𝑑~𝑄1subscript𝑝𝑓𝑎\gamma={\tilde{\boldsymbol{\Phi}}_{MASSFormer|\mathcal{H}_{0}}}\left({round% \left({{\tilde{Q}\left({{1-{p_{fa}}}}\right)}}\right)}\right).italic_γ = over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT italic_M italic_A italic_S italic_S italic_F italic_o italic_r italic_m italic_e italic_r | caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r italic_o italic_u italic_n italic_d ( over~ start_ARG italic_Q end_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_f italic_a end_POSTSUBSCRIPT ) ) ) . (23)

The function round(.)round(.)italic_r italic_o italic_u italic_n italic_d ( . ) provides the nearest integer value of a given number. For a test data, a decision about the activity states of PU can be made as follows

γ01βMASSFormer(𝚽u).𝛾superscriptsubscriptless-than-or-greater-thansubscript0subscript1subscript𝛽𝑀𝐴𝑆𝑆𝐹𝑜𝑟𝑚𝑒𝑟subscript𝚽𝑢\gamma\mathop{\lessgtr}\limits_{\mathcal{H}_{0}}^{\mathcal{H}_{1}}\beta_{% MASSFormer}(\boldsymbol{\Phi}_{u}).italic_γ ≶ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_M italic_A italic_S italic_S italic_F italic_o italic_r italic_m italic_e italic_r end_POSTSUBSCRIPT ( bold_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) . (24)

IV Numerical Results

IV-A Simulation environment and Hyperparameters

We consider the simulation area of 1000100010001000m ×\times× 1000100010001000m area, and they move with a velocity randomly chosen between [20,25]2025[20,25][ 20 , 25 ] m/sec, resulting in dynamic changes in their positions over time. The pause time is chosen as 1111msec. This setup allows us to assess the feasibility of the proposed model under conditions of user mobility. We assume that PU signals are independent and identically distributed (i.i.d.) Gaussian Random vector with zero mean and signal variance σw2superscriptsubscript𝜎𝑤2\sigma_{w}^{2}italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We have chosen values of N=100𝑁100N=100italic_N = 100, S=3𝑆3S=3italic_S = 3, and M=15𝑀15M=15italic_M = 15, length of temporal sequence λ=20𝜆20\lambda=20italic_λ = 20 randomly. PU signals are assumed to be sent on a channel having bandwidth BWsubscript𝐵𝑊B_{W}italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT is 10101010MHz. Furthermore, we assume that Pt=200mWsubscript𝑃𝑡200𝑚𝑊P_{t}=200mWitalic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 200 italic_m italic_W, β=103.453𝛽superscript103.453\beta=10^{3.453}italic_β = 10 start_POSTSUPERSCRIPT 3.453 end_POSTSUPERSCRIPT and α=3.8𝛼3.8\alpha=3.8italic_α = 3.8, and noise power density is N0=150dBm/Hzsubscript𝑁0150𝑑𝐵𝑚𝐻𝑧N_{0}=-150dBm/Hzitalic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 150 italic_d italic_B italic_m / italic_H italic_z. We assumed that the noise signal represented by ηsm(n)subscriptsuperscript𝜂𝑚𝑠𝑛\eta^{m}_{s}(n)italic_η start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_n ) and ηc(n)subscript𝜂𝑐𝑛\eta_{c}(n)italic_η start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n ) follow an i.i.d Gaussian random vector with zero mean and noise variance ση2superscriptsubscript𝜎𝜂2\sigma_{\eta}^{2}italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. To address the challenges occurring due to user mobility in real-world scenarios, we assumed that PU signals reaching from multiple SUs to fusion centre may experience different path loss and fading severity. Consequently, we assumed that channel gains hsm(n)superscriptsubscript𝑠𝑚𝑛h_{s}^{m}(n)italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) and hs,rm(n)superscriptsubscript𝑠𝑟𝑚𝑛h_{s,r}^{m}(n)italic_h start_POSTSUBSCRIPT italic_s , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_n ) follow Rayleigh distributions and fading severity is modeled by adjusting the fading parameter. The received signal power is calculated using equation (4), which incorporates path-loss between PU and each SU. The hyperparameters used for our proposed architecture are provided in Table I. The number of training and testing points used for training and evaluating the proposed models are 104,000104000104,000104 , 000 and 15,0001500015,00015 , 000. The number of epochs and batch size are considered to be 100100100100 and 16161616.

TABLE I: HYPER-PARAMETERS SETTINGS
Parameter settings
Input shape:- Dimension (20,16,16,3)
Patch/token size: Dimension (20,1,1)
Parameter description
SU-Tranformer
Network
Collaborative
Transformer Network
3D convoultion
Kernel size:24@(20,1,1)
stride: Patch size
-
Linear layer
projection of input
Projection dimention 24 24
No. of Heads 4 4
No. of encoder layers 5 4
Transformer units [48,24] [48,24]
MLP head units [128,64] [128,64]
Learning Rate 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Total parameters 42667426674266742667 31747317473174731747
TABLE II: FLOPs for MASSFormer
Layers Description
FLOPs
(SU-Transformer)
FLOPs (Collaborative
Transformer)
Patch Embedding 245,760 576
MSA 1,376,256 1,376,256
MLP in Encoder 1,179,648 1,179,648
Layer Normalization 24,576 24,576
Total FLOPs for
Encoder Layer
5 ×\times× (1,376,256 +
1,179,648 + 24,576)
4 ×\times× (1,376,256 +
1,179,648 + 24,576)
Sequence Pooling 12,864 12,864
MLP Head 11,392 11,392
Total FLOPs 13,172,416 10,346,752
MASSFormer
Params: 74,414
FLOPs: 23,519,168
TABLE III: FLOPs for CNN-LSTM
Layers Description Parameters value FLOPs
Input: (20,16,16,3)
Conv. Layer Kernel: 32 @(3,3) 8,847,360
Max-pooling layer Kernel (2,2) stride:(1,1) -
Global Avg. Pooling - -
Dense layer 1 dimension: 128 163,840
LSTM Units:32 819,200
Dense layer 2 Unit: 2 128
Total FLOPs and Params Total Params: 25,794 9,830,400
TABLE IV: FLOPs for 3D CNN
Layers Description Parameters value FLOPs
Input: (20,16,16,3)
Conv. Layer Kernel: 32@@@@(3,3,3) 26,542,080
Max-pooling layer Kernel (2,2,2) stride:(1,1,1) -
Conv. Layer Kernel: 24 @@@@(3,3,3) 177,292,800
Max-pooling layer Kernel (2,2,2) stride:(1,1,1) -
Global Avg. Pooling - -
Dense layer 1 dimension: 64 1536
Dense layer 2 Unit: 2 128
Total FLOPs Total Params: 25,114 203,836,544
Refer to caption
Figure 3: ROC curve for various SS methods
Refer to caption
Figure 4: Probability of detection vs noise power density (pfa=0.09subscript𝑝𝑓𝑎0.09p_{fa}=0.09italic_p start_POSTSUBSCRIPT italic_f italic_a end_POSTSUBSCRIPT = 0.09)

IV-B Simulation Results

We conducted comprehensive simulations to illustrate the efficacy of the proposed MASSFormer method. We evaluated the detection performance in terms of receiver operating characteristics (ROC) curve, probability of detection (Pdsubscript𝑃𝑑P_{d}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) vs noise power density (N0)subscript𝑁0(N_{0})( italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and sensing error vs noise power density. We compared and analyzed the performance of MASSFormer method with existing methods such as CNN-LSTM [16], APASS [9], and 3333D CNN [21] methods. We rigorously evaluated the performance of the proposed method under various scenarios, including imperfect sensing as well as both perfect and imperfect reporting channels, to thoroughly test its robustness. We calculated the Pdsubscript𝑃𝑑P_{d}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT values at different detection thresholds computed at different values of pfasubscript𝑝𝑓𝑎p_{fa}italic_p start_POSTSUBSCRIPT italic_f italic_a end_POSTSUBSCRIPT to evaluate the detection performance. Fig. 3 shows the ROC curve of the proposed MASSFormer method along with existing methods at two different noise power density values such as N0=150dBm/Hzsubscript𝑁0150𝑑𝐵𝑚𝐻𝑧N_{0}=-150dBm/Hzitalic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 150 italic_d italic_B italic_m / italic_H italic_z and N0=145dBm/Hzsubscript𝑁0145𝑑𝐵𝑚𝐻𝑧N_{0}=-145dBm/Hzitalic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 145 italic_d italic_B italic_m / italic_H italic_z.

The results show that the proposed MASSFormer method performs better as compared to methods CNN-LSTM, APASS, and 3333D CNN in terms of detection performance.

Refer to caption
Figure 5: Sensing error vs noise power density
Refer to caption
Figure 6: Classification Accuracy vs noise power density

In accordance with IEEE 802.22 standard, the maximum acceptable value of false alarm probability is (Pf0.1)subscript𝑃𝑓0.1(P_{f}\leq 0.1)( italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ≤ 0.1 ). Therefore, we provide a plot of Pdsubscript𝑃𝑑P_{d}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT versus noise power density at a fixed value of false alarm probability Pf=0.09subscript𝑃𝑓0.09P_{f}=0.09italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 0.09 in Fig. 4. From Fig. 3 and Fig. 4, we observe that the detection probability decreases as the noise power density increases. The sensing error, which is defined as the mean of the probability of miss detection and probability of false alarm, another performance metric, is used to evaluate the performance of the proposed method.

Fig. 5 depicts a plot of sensing error vs noise power density for different methods. We observe that the sensing error increases with increasing noise power density, reflecting inaccuracies in individual SUs’ sensing outcomes. After analysis, we find that the proposed method achieves lower sensing error than the existing methods, highlighting its benefits. Fig. 6 depicts the plot of classification accuracy vs noise power density for different methods. After a comprehensive analysis of the results, it is evident that the proposed MASSFormer method outperforms the existing methods, exhibiting superior detection performance in terms of detection probability and classification accuracy.

V Computational Complexity Analysis

Although the proposed method achieves improved detection performance, the computational complexity of detection methods should be analyzed for a fair comparison. Since training of neural networks can be performed offline, the paper focused on the computational complexity of inference after the model is deployed. The computational complexity to convert the sensing signal 𝒀ussuperscriptsubscript𝒀𝑢𝑠\boldsymbol{Y}_{u}^{s}bold_italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT into the CM 𝑹ussuperscriptsubscript𝑹𝑢𝑠\boldsymbol{R}_{u}^{s}bold_italic_R start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is O(M2N)Osuperscript𝑀2𝑁\text{O}(M^{2}N)O ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N ). Let l𝑙litalic_l and LCNNsubscript𝐿𝐶𝑁𝑁L_{CNN}italic_L start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT represent present convolutional layer’s index and total number of convolutional layers. We define the computational complexity of the convolutional layer as O(l=1LCNNnf,l1ns,l2nf,lms,l2)𝑂superscriptsubscript𝑙1subscript𝐿𝐶𝑁𝑁subscript𝑛𝑓𝑙1superscriptsubscript𝑛𝑠𝑙2subscript𝑛𝑓𝑙subscriptsuperscript𝑚2𝑠𝑙{O}\left({\sum_{l=1}^{L_{CNN}}n_{f,l-1}\cdot n_{s,l}^{2}\cdot n_{f,l}\cdot m^{% 2}_{s,l}}\right)italic_O ( ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_f , italic_l - 1 end_POSTSUBSCRIPT ⋅ italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_n start_POSTSUBSCRIPT italic_f , italic_l end_POSTSUBSCRIPT ⋅ italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT ), where nf,lsubscript𝑛𝑓𝑙n_{f,l}italic_n start_POSTSUBSCRIPT italic_f , italic_l end_POSTSUBSCRIPT and nf,l1subscript𝑛𝑓𝑙1n_{f,l-1}italic_n start_POSTSUBSCRIPT italic_f , italic_l - 1 end_POSTSUBSCRIPT show the convolutional kernels count of l𝑙litalic_lth and (l1)𝑙1(l-1)( italic_l - 1 )th layer. We define the spatial size of kernel current layer and the resulting output feature map as ns,lsubscript𝑛𝑠𝑙n_{s,l}italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT and ms,lsubscript𝑚𝑠𝑙m_{s,l}italic_m start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT respectively. With the CNN stride to be 1111, we have ms,1=Msubscript𝑚𝑠1𝑀m_{s,1}=Mitalic_m start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT = italic_M and the total number of SUs becomes the number of input channel i.e. nf,0=Ssubscript𝑛𝑓0𝑆n_{f,0}=Sitalic_n start_POSTSUBSCRIPT italic_f , 0 end_POSTSUBSCRIPT = italic_S. The complexity of single convolutional layer is increased by the temporal length λ𝜆\lambdaitalic_λ since every sample has λ𝜆\lambdaitalic_λ CMs in the dataset. The complexity of LSTM network is denoted as O(4λnl(nf,1+nl))𝑂4𝜆subscript𝑛𝑙subscript𝑛𝑓1subscript𝑛𝑙{O}\left(4\lambda n_{l}\left(n_{f,1}+n_{l}\right)\right)italic_O ( 4 italic_λ italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_f , 1 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ), where the dimension of the each gate of LSTM cell are represented by nlsubscript𝑛𝑙n_{l}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Therefore, complexty of CNN-LSTM method, which consists of Convolutional layers, Max-pooling layers, Global average pooling layer, and two dense layers, is provided below

O(λM2nf,1ns,12S+4λnl1(nf,1+nl)+nf,1nfc1+nlnfc2)𝑂𝜆superscript𝑀2subscript𝑛𝑓1superscriptsubscript𝑛𝑠12𝑆4𝜆subscript𝑛𝑙1subscript𝑛𝑓1subscript𝑛𝑙subscript𝑛𝑓1subscript𝑛𝑓𝑐1subscript𝑛𝑙subscript𝑛𝑓𝑐2{O}\left(\lambda M^{2}n_{f,1}{n_{s,1}^{2}S}+{4\lambda n_{l1}(n_{f,1}+n_{l}})+{% n_{f,1}}{n_{fc1}+n_{l}n_{fc2}}\right)start_ROW start_CELL italic_O ( italic_λ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_f , 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S + 4 italic_λ italic_n start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_f , 1 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) + italic_n start_POSTSUBSCRIPT italic_f , 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_f italic_c 1 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_f italic_c 2 end_POSTSUBSCRIPT ) end_CELL end_ROW

where, nfc1subscript𝑛𝑓𝑐1n_{fc1}italic_n start_POSTSUBSCRIPT italic_f italic_c 1 end_POSTSUBSCRIPT and nfc2subscript𝑛𝑓𝑐2n_{fc2}italic_n start_POSTSUBSCRIPT italic_f italic_c 2 end_POSTSUBSCRIPT denotes the dimensions of the first and second dense layer respectively, The computational complexity of 3333-D CNN is computed as below

O(λM2(nf,1ms,13S+nf,1nf,2ms,23)+nf,2dfc1+dfc1dfc2)𝑂𝜆superscript𝑀2subscript𝑛𝑓1superscriptsubscript𝑚𝑠13𝑆subscript𝑛𝑓1subscript𝑛𝑓2superscriptsubscript𝑚𝑠23subscript𝑛𝑓2subscript𝑑𝑓𝑐1subscript𝑑𝑓𝑐1subscript𝑑𝑓𝑐2{O}\left(\lambda M^{2}(n_{f,1}{m_{s,1}^{3}}S+n_{f,1}n_{f,2}{m_{s,2}^{3}})+n_{f% ,2}d_{fc1}+d_{fc1}d_{fc2}\right)start_ROW start_CELL italic_O ( italic_λ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n start_POSTSUBSCRIPT italic_f , 1 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S + italic_n start_POSTSUBSCRIPT italic_f , 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_f , 2 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) + italic_n start_POSTSUBSCRIPT italic_f , 2 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f italic_c 1 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_f italic_c 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f italic_c 2 end_POSTSUBSCRIPT ) end_CELL end_ROW

where nf,1subscript𝑛𝑓1n_{f,1}italic_n start_POSTSUBSCRIPT italic_f , 1 end_POSTSUBSCRIPT, and nf,2subscript𝑛𝑓2n_{f,2}italic_n start_POSTSUBSCRIPT italic_f , 2 end_POSTSUBSCRIPT denotes the number of filters of first and second convolutional layer respectively. We define the spatial size of the filter in the first layer and second as ms,1subscript𝑚𝑠1m_{s,1}italic_m start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT and ms,2subscript𝑚𝑠2m_{s,2}italic_m start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT, respectively. dfc1subscript𝑑𝑓𝑐1d_{fc1}italic_d start_POSTSUBSCRIPT italic_f italic_c 1 end_POSTSUBSCRIPT and dfc2subscript𝑑𝑓𝑐2d_{fc2}italic_d start_POSTSUBSCRIPT italic_f italic_c 2 end_POSTSUBSCRIPT denote the dimensions of first and second dense layer respectively.

The computational complexity of the proposed MASSFormer method is

O(L1hatt1M2λdemb1+L1Mdfc1)+O(L2hatt2M2λdemb2+L2Mdfc2)𝑂subscript𝐿1subscript𝑎𝑡𝑡1superscript𝑀2𝜆subscript𝑑𝑒𝑚𝑏1subscript𝐿1𝑀subscript𝑑𝑓𝑐1𝑂subscript𝐿2subscript𝑎𝑡𝑡2superscript𝑀2𝜆subscript𝑑𝑒𝑚𝑏2subscript𝐿2𝑀subscript𝑑𝑓𝑐2{O}\left(L_{1}h_{att1}M^{2}\lambda d_{emb1}+L_{1}Md_{fc1}\right)+\\ {O}\left(L_{2}h_{att2}M^{2}\lambda d_{emb2}+L_{2}Md_{fc2}\right)start_ROW start_CELL italic_O ( italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_a italic_t italic_t 1 end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b 1 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_M italic_d start_POSTSUBSCRIPT italic_f italic_c 1 end_POSTSUBSCRIPT ) + end_CELL end_ROW start_ROW start_CELL italic_O ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_a italic_t italic_t 2 end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b 2 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M italic_d start_POSTSUBSCRIPT italic_f italic_c 2 end_POSTSUBSCRIPT ) end_CELL end_ROW

where L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the number of transformer layers, hatt1subscript𝑎𝑡𝑡1h_{att1}italic_h start_POSTSUBSCRIPT italic_a italic_t italic_t 1 end_POSTSUBSCRIPT and hatt2subscript𝑎𝑡𝑡2h_{att2}italic_h start_POSTSUBSCRIPT italic_a italic_t italic_t 2 end_POSTSUBSCRIPT number of attention heads in SU-transformer and collaborative transformer network respectively. demb1subscript𝑑𝑒𝑚𝑏1d_{emb1}italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b 1 end_POSTSUBSCRIPT and demb2subscript𝑑𝑒𝑚𝑏2d_{emb2}italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b 2 end_POSTSUBSCRIPT denote the projection dimensions, dfc1subscript𝑑𝑓𝑐1d_{fc1}italic_d start_POSTSUBSCRIPT italic_f italic_c 1 end_POSTSUBSCRIPT and dfc2subscript𝑑𝑓𝑐2d_{fc2}italic_d start_POSTSUBSCRIPT italic_f italic_c 2 end_POSTSUBSCRIPT denote the dimension of the dense layer of MLP block in SU-transformer and collaborative transformer network respectively.

The computational complexity of these methods is also computed in terms of multiply accumulates (MACs) or floating point operations (FLOPs) as provided in Table II, III, and IV. From the analysis, it is concluded that the FLOPs required for the proposed MASSFormer method are less as compared to the 3333D CNN method and higher than the CNN-LSTM method. We measured the time needed for data preparation and model inference. We calculated the time for preprocessing the data for 1000100010001000 samples, where each sample comprises of λ=20𝜆20\lambda=20italic_λ = 20 CMs. After averaging, the time required for one data sample is 0.0990.0990.0990.099 msec. For model inference, the CNN-LSTM, 3-D CNN, and APASS detector had inference times of 0.170.170.170.17, 0.650.650.650.65, and 0.870.870.870.87 msec, respectively. Our proposed MASSFormer method requires 2.462.462.462.46 msec for inference. According to the IEEE 802.22802.22802.22802.22 standards, SUs are required to evacuate the spectrum within 2222 seconds when a PU becomes active. Therefore, our model detects the PU state within 2.462.462.462.46 msec satisfying the real-time latency constraint.

VI Conclusion

In this work, we proposed a MASSFormer method to predict the PU states in mobile scenarios. Since the prediction of PU state at SU-level and group-level, both events are occurring simultaneously over time. Therefore, inspired by this, a model is developed that uses SU-transformer network to predict PU states at SU-level and collaborative transformer network to predict PU states at group-level by modeling the spatio-temoral dynamics of movements of all contributing SUs. From simulations results, it is evident that our MASSFormer method outperforms existing methods in terms of Pdsubscript𝑃𝑑P_{d}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT demonstrating superior sensing performance.

References

  • [1] S. Kumar, “Performance of ed based spectrum sensing over α𝛼\alphaitalic_αη𝜂\etaitalic_ημ𝜇\muitalic_μ fading channel,” Wireless Personal Communications, vol. 100, no. 4, pp. 1845–1857, 2018.
  • [2] K. M. Thilina, K. W. Choi, N. Saquib, and E. Hossain, “Machine learning techniques for cooperative spectrum sensing in cognitive radio networks,” IEEE Journal on selected areas in communications, vol. 31, no. 11, pp. 2209–2221, 2013.
  • [3] D. Janu, K. Singh, and S. Kumar, “Machine learning for cooperative spectrum sensing and sharing: A survey,” Transactions on Emerging Telecommunications Technologies, vol. 33, no. 1, p. e4352, 2022.
  • [4] C. Liu, X. Liu, and Y.-C. Liang, “Deep cnn for spectrum sensing in cognitive radio,” in ICC 2019 - 2019 IEEE International Conference on Communications (ICC), 2019, pp. 1–6.
  • [5] C. Liu, J. Wang, X. Liu, and Y.-C. Liang, “Deep cm-cnn for spectrum sensing in cognitive radio,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2306–2321, 2019.
  • [6] Z. Chen, Y.-Q. Xu, H. Wang, and D. Guo, “Deep stft-cnn for spectrum sensing in cognitive radio,” IEEE Communications Letters, vol. 25, no. 3, pp. 864–868, 2020.
  • [7] D. Janu, S. Kumar, and K. Singh, “A graph convolution network based adaptive cooperative spectrum sensing in cognitive radio network,” IEEE Transactions on Vehicular Technology, vol. 72, no. 2, pp. 2269–2279, 2022.
  • [8] A. Mehrabian, M. Sabbaghian, and H. Yanikomeroglu, “Cnn-based detector for spectrum sensing with general noise models,” IEEE Transactions on Wireless Communications, vol. 22, no. 2, pp. 1235–1249, 2022.
  • [9] J. Xie, C. Liu, Y.-C. Liang, and J. Fang, “Activity pattern aware spectrum sensing: A cnn-based deep learning approach,” IEEE Communications Letters, vol. 23, no. 6, pp. 1025–1028, 2019.
  • [10] B. Soni, D. K. Patel, and M. Lopez-Benitez, “Long short-term memory based spectrum sensing scheme for cognitive radio using primary activity statistics,” IEEE Access, vol. 8, pp. 97 437–97 451, 2020.
  • [11] W. Chen, H. Wu, and S. Ren, “Cm-lstm based spectrum sensing,” Sensors, vol. 22, no. 6, p. 2286, 2022.
  • [12] L. Yu, J. Chen, G. Ding, Y. Tu, J. Yang, and J. Sun, “Spectrum prediction based on taguchi method in deep learning with long short-term memory,” IEEE Access, vol. 6, pp. 45 923–45 933, 2018.
  • [13] J. Gao, X. Yi, C. Zhong, X. Chen, and Z. Zhang, “Deep learning for spectrum sensing,” IEEE Wireless Communications Letters, vol. 8, no. 6, pp. 1727–1730, 2019.
  • [14] K. Yang, Z. Huang, X. Wang, and X. Li, “A blind spectrum sensing method based on deep learning,” Sensors, vol. 19, no. 10, p. 2270, 2019.
  • [15] D. Ke, Z. Huang, X. Wang, and X. Li, “Blind detection techniques for non-cooperative communication signals based on deep learning,” IEEE Access, vol. 7, pp. 89 218–89 225, 2019.
  • [16] J. Xie, J. Fang, C. Liu, and X. Li, “Deep learning-based spectrum sensing in cognitive radio: A cnn-lstm approach,” IEEE Communications Letters, vol. 24, no. 10, pp. 2196–2200, 2020.
  • [17] S. Solanki, V. Dehalwar, and J. Choudhary, “Deep learning for spectrum sensing in cognitive radio,” Symmetry, vol. 13, no. 1, p. 147, 2021.
  • [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [20] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846.
  • [21] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  • [22] T. Camp, J. Boleng, and V. Davies, “A survey of mobility models for ad hoc network research,” Wireless communications and mobile computing, vol. 2, no. 5, pp. 483–502, 2002.
  • [23] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  • [24] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  • [25] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi, “Escaping the big data paradigm with compact transformers,” arXiv preprint arXiv:2104.05704, 2021.
[Uncaptioned image] Dimpal Janu Dimpal Janu received her B.Tech. degree in Electronics and Communication Engineering from Govt. Mahila engineering College, Ajmer, India in 2014, and M.Tech. degree in Electronics and Communication Engineering from Malaviya national institute of technology (MNIT), Jaipur, India in 2018. She is currently pursuing a Ph.D. degree from MNIT Jaipur. Her research interests include wireless communication, cognitive radio network, machine learning, and deep learning. She is a member of the IEEE student branch.
[Uncaptioned image] Sandeep Mandia received the MTech degree from National Institute of Technology, Silchar, India in 2014 and the PhD degree in computer vision from Malaviya National Institute of Technology Jaipur, India in 2023. He is currently serving as an Assistant Professor at Thapar Institute of Engineering and Technology, Patiala, India. His research interests are machine/ deep learning applications in student engagement analysis, medical diagnostics, and beyond.
[Uncaptioned image] Kuldeep Singh received his MTech degree in Signal Processing from Delhi University in 2006200620062006 and a Ph.D. degree in Computer Vision from Delhi Technological University, India, in 2016201620162016. He is currently an Associate Professor with the Department of Electronics &\&& Communication Engineering, Malaviya National Institute of Technology, Jaipur, India. Previously, he was a Senior Scientist with the Central Research Lab, Bharat Electronics Ltd., India. He also worked as a postdoctoral fellow at the University of Alberta, Canada from October 2017 to April 2018. His research interest includes Computer Vision, Machine/ Deep Learning, Biometrics, and Cyber Security. He is a reviewer of various IEEE transactions, Elsevier, and Springer journals.
[Uncaptioned image] Sandeep Kumar received his B. Tech. in electronics and communication from Kurukshetra University, India in 2004 and Master of Engineering in Electronics and Communication from Thapar University, Patiala, India in 2007. He received his Ph.D. from Delhi Technological University, Delhi, India in 2018. He is currently working as Member (Senior Research Staff) at Central Research Laboratory, Bharat Electronics Limited Ghaziabad, India. He has received various awards and certificates of appreciation for his research activities. His research interests include the study of wireless channels, performance modeling of fading channels, and cognitive radio networks. He is also serving as a reviewer for IEEE, Elsevier, and Springer journals.