\interspeechcameraready\name

[]Seung-binKim \name[]Chan-yeongLim \name[]JungwooHeo \name[]Ju-hoKim \name[]Hyun-seoShin \name[]Kyo-WonKoo \name[]Ha-JinYu

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms

Abstract

In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw waveforms. The MR-RawNet extracts time-frequency representations from raw waveforms via a multi-resolution feature extractor that optimally adjusts both temporal and spectral resolutions simultaneously. Furthermore, we apply a multi-resolution attention block that focuses on diverse and extensive temporal contexts, ensuring robustness against changes in utterance length. The experimental results, conducted on VoxCeleb1 dataset, demonstrate that the MR-RawNet exhibits superior performance in handling utterances of variable duration compared to other raw waveform-based systems.

keywords:
speaker verification, raw waveform, short duration, multi resolution

1 Introduction

Speaker verification (SV) is the task of verifying whether an anonymous speaker is the target speaker registered in the system. With the development of deep neural networks (DNNs), traditional machine learning-based SV systems have been largely replaced by DNN-based SV systems [1, 2, 3]. Despite the exceptional performance achieved by DNN-based SV systems, most systems are typically evaluated using long utterances [4, 5, 6]. However, in real environments, shorter utterances of 1 to 2 seconds are often encountered, and SV systems experience a performance degradation as the length of the utterance decreases. This degradation occurs because short utterances may not contain sufficient speaker-specific phonetic characteristics that can be obtained from speech [7, 8, 9]. Thus, SV systems should be constructed to yield similar performance across utterances of various lengths, particularly shorter ones.

To enhance robustness against variable lengths, we initially focused on the input features of the system. The inputs for SV systems can be traditional handcrafted features such as MFCCs or Mel-filter banks [2, 3, 4, 5], or raw waveforms, which have recently begun to be utilized [10, 11, 12, 13]. While handcraft features can provide refined information through processing based on human knowledge, they may also offer information loss compared to the unprocessed raw waveforms [14, 15, 16]. Especially for SV systems that consider limited information, as much information as possible should be available from the input. Therefore, we use raw waveforms as input to ensure robustness to utterances of various lengths, which have high potential due to the absence of information loss.

In traditional raw waveform-based systems, a one-dimensional input is converted into a two-dimensional feature map through a special module in the first layer, such as [17, 18]. The first-layer module extracts meaningful representations from the input, typically based on a fixed frame size. However, representations based on a constant frame size are unable to offer superior time and spectrum resolution simultaneously [19]. Consequently, these systems often prioritize spectral resolution, potentially degrading performance for various temporal utterances. At this point, we introduced a multi-resolution encoder (MRE) designed to capture information at different temporal resolutions. The MRE, a module proposed in Han et𝑒𝑡etitalic_e italic_t al.𝑎𝑙al.italic_a italic_l . [19], extracts multiple features at various temporal resolutions. We specifically tailored this module for use in the first-layer module of a raw waveform-based system, naming it multi-resolution feature extractor (MRFE). Thus, the MRFE is able to consider both temporal and spectral resolution simultaneously,deriving features that maintain robustness across various lengths.

In addition, we propose a new bottleneck named the multi-resolution attention (MRA) block. The MRA combines the advantages of Res2Dilated block [4] and an extended dynamic scaling policy (EDSP) [9]. The Res2Dilated block gradually builds up the temporal context using dilated convolutional layers and hierarchical residual connection [20]. The EDSP, an extension of the Elastic [21], trains the network to operate dynamically based on the scale of the data. These two modules each focus on different aspects of the temporal context: the Res2Dilated block focuses on broader contexts, while the EDSP considers variety of contexts. Meanwhile, various studies have focused on exploring different temporal resolutions to effectively maximize speaker information from utterances for robustness against utterance length [22, 23]. By merging these two modules, the MRA block can consider a wider and more diverse temporal context, thereby enhancing robustness to changes in utterance length.

We finally propose a novel structure called MR-RawNet, which is robust to variations in utterance length by utilizing both the MRFE and the MRA. Experimental results on VoxCeleb1&2 datasets [24, 25] demonstrated that MR-RawNet showed superior performance for variable duration utterances in raw waveform-based systems.

2 Baseline

Refer to caption
Figure 1: (a): Baseline structure. The kernel size of Conv1D is 1, and p𝑝pitalic_p represents max pooling size. (b): MR-RawNet structure. k𝑘kitalic_k, d𝑑ditalic_d, N𝑁Nitalic_N, and B𝐵Bitalic_B denote kernel size, dilation, the number of feature extractors, and the number of MRA blocks.

In a recent study on SV that utilized raw waveforms as input, RawNet3 [13] achieved remarkable performance. Nevertheless, the authors posit that there is still potential for enhancement in RawNet3. This section delineates the architecture of RawNet3 and outlines potential areas for improvement.

RawNet3 is a modified structure of the ECAPA-TDNN [4], which is widely used in SV research. As shown in Figure 1-(a), RawNet3 is composed of a parameterized analysis filterbank (ParamFbank) [18] layer and a Res2Dilated block with α𝛼\alphaitalic_α-feature map scaling and max pooling (AFMS-Res2MP). When the one-dimensional raw waveform is fed into RawNet3, it is transformed into a two-dimensional feature map by the ParamFbank layer. The ParamFbank learns real-valued parameterised filterbanks as an extension of the SincNet layer [17]. The feature map, processed through the ParamFbank layer, is then hierarchically passed through three AFMS-Res2MP blocks to capture speaker information at various levels. The AFMS-Res2MP is a block that replaces the squeeze-excitation layer in the squeeze-excitation block [26] with the α𝛼\alphaitalic_α-feature map scaling (AFMS) [12]. The extracted speaker information is finally processed into speaker embeddings using convolution, attentive statistics pooling (ASP) [27] with channel- and context-dependent attention values, and linear hierarchy.

We suggest that there is room for improvement in both the ParamFbank module and the AFMS-Res2MP block within RawNet3. Utilizing information from different time scales could allow them to be robust to different lengths of input speech.

3 Proposed methods

Our proposed architecture enhances the existing RawNet3 to be robust against utterances of various lengths by applying both MRFE and MRA methods. Figure 1-(b) illustrates the overall structure of our proposed system. The MR-RawNet is a structure that replaces the first convolutional module and bottleneck blocks in the baseline with The MRFE and MRA. The raw waveform of length T𝑇Titalic_T is fed to the MRFE module to generate a time-frequency representation. The MRFE consists of N𝑁Nitalic_N feature extractors (FEs) and combines the outputs of the FEs into a feature map o1F×TSsubscript𝑜1superscript𝐹𝑇𝑆o_{1}\in\mathbb{R}^{F\times\frac{T}{S}}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT, where F𝐹Fitalic_F and S𝑆Sitalic_S refer to the channel and stride size of the MRFE, respectively. The feature map o1subscript𝑜1o_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is then fed into a convolutional layer for input into the bottleneck stages, and its output feature o2C×TSsubscript𝑜2superscript𝐶𝑇𝑆o_{2}\in\mathbb{R}^{C\times\frac{T}{S}}italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT, where C𝐶Citalic_C refer to the channel size of the MRA block, is digested to the bottleneck stage. Each bottleneck stage consists of B𝐵Bitalic_B MRA blocks, its outputs o3,o4,o5C×TSsubscript𝑜3subscript𝑜4subscript𝑜5superscript𝐶𝑇𝑆o_{3},o_{4},o_{5}\in\mathbb{R}^{C\times\frac{T}{S}}italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT are concatenated into a feature o63C×TSsubscript𝑜6superscript3𝐶𝑇𝑆o_{6}\in\mathbb{R}^{3C\times\frac{T}{S}}italic_o start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_C × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT, similarly to the ECAPA-TDNN. The concatenated feature o6subscript𝑜6o_{6}italic_o start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT is input into a convolutional layer with a kernel size of 1, followed by a ReLU function. Then, the output o71536×TSsubscript𝑜7superscript1536𝑇𝑆o_{7}\in\mathbb{R}^{1536\times\frac{T}{S}}italic_o start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1536 × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT is aggregated into an utterance-level feature, and the speaker embedding is an output obtained by inputting the utterance-level features o83072×1subscript𝑜8superscript30721o_{8}\in\mathbb{R}^{3072\times 1}italic_o start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3072 × 1 end_POSTSUPERSCRIPT into a linear layer.

3.1 Multi-resolution feature extractor

Refer to caption
Figure 2: MRFE structure. k𝑘kitalic_k, s𝑠sitalic_s, R𝑅Ritalic_R and X𝑋Xitalic_X denote kernel size, stride size, the number of repeats, and the number of convolutional blocks in each repeat, respectively.

When extracting a time-frequency representation from the raw waveform, longer windows yield superior spectral resolution at the cost of temporal resolution, and the opposite is true for shorter windows. The first-layer modules of existing raw waveform-based SV systems typically used windows of a fixed frame size. However, features derived from a fixed frame size cannot concurrently yield excellent time and frequency resolution. From this perspective, we designed to combine multi-resolution information from the raw waveforms at a low-level to generate a robust time-frequency representation for different utterance lengths. The MRFE module compresses the information contained in the raw waveform at a low level. This is similar to extracting handcraft features, but has the advantage of being able to extract features suitable for a specific task using a data-driven method of DNN.

Figure 2 illustrates the structure of the MRFE. In order to extract discriminative representations from the raw waveform, we used N𝑁Nitalic_N ParamFbank layers in parallel. The i𝑖iitalic_i-th ParamFbank layer (1 \leq i𝑖iitalic_i \leq N𝑁Nitalic_N) has a kernel size of Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a stride size of 2Ki52subscript𝐾𝑖5\frac{2K_{i}}{5}divide start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 5 end_ARG, and outputs a time-frequency feature F1×5T2Kiabsentsuperscriptsubscript𝐹15𝑇2subscript𝐾𝑖\in\mathbb{R}^{F_{1}\times\frac{5T}{2K_{i}}}∈ blackboard_R start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × divide start_ARG 5 italic_T end_ARG start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT. The feature is then fed to the convolutional layer with kernel size 1 (1×\times×1 Conv) for input to a temporal convolutional network (TCN). The TCN is a structure proposed to replace recurrent neural network in various sequence modeling tasks, and is used in various fields to extract speech features [28, 29, 30]. The TCN consists of stacked dilated 1-D convolutional blocks, where the dilation factor increases exponentially to ensure a sufficiently large temporal context window. The 1-D convolutional block is composed of one depthwise convolution between two pointwise convolutions. The parametric rectified linear unit (PReLU) [31] and global layer normalization (gLN) [29] are used as a nonlinear activation function and a normalization technique in the blocks, respectively. We used X𝑋Xitalic_X convolutional blocks with dilation factors 1,2,,2X112superscript2𝑋11,2,\cdots,2^{X-1}1 , 2 , ⋯ , 2 start_POSTSUPERSCRIPT italic_X - 1 end_POSTSUPERSCRIPT repeated R𝑅Ritalic_R times.

The i𝑖iitalic_i-th TCN output F2×5T2Kiabsentsuperscriptsubscript𝐹25𝑇2subscript𝐾𝑖\in\mathbb{R}^{F_{2}\times\frac{5T}{2K_{i}}}∈ blackboard_R start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × divide start_ARG 5 italic_T end_ARG start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT passes through a last convolutional layer, which is processed into a time-frequency representation yiF2×TSsubscript𝑦𝑖superscriptsubscript𝐹2𝑇𝑆y_{i}\in\mathbb{R}^{F_{2}\times\frac{T}{S}}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT. The i𝑖iitalic_i-th last convolutional layer has a kernel size of Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a stride size of Mi2subscript𝑀𝑖2\frac{M_{i}}{2}divide start_ARG italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. We set the feature extractor (FE) stride size S𝑆Sitalic_S to Ki×Mi5subscript𝐾𝑖subscript𝑀𝑖5\frac{K_{i}\times M_{i}}{5}divide start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 5 end_ARG to ensure that outputs of all FEs have the same temporal dimension. Additionally, the i𝑖iitalic_i-th output of TCN is also added to the (i𝑖iitalic_i+1)-th FE using a max pooling with a downsampling factor of 2. The kernel sizes Ki+1subscript𝐾𝑖1K_{i+1}italic_K start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT and Mi+1subscript𝑀𝑖1M_{i+1}italic_M start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT of the (i𝑖iitalic_i+1)-th FE are defined as Ki+1=2Kisubscript𝐾𝑖12subscript𝐾𝑖K_{i+1}=2K_{i}italic_K start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Mi+1=Mi2subscript𝑀𝑖1subscript𝑀𝑖2M_{i+1}=\frac{M_{i}}{2}italic_M start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = divide start_ARG italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG, respectively. Finally, the MRFE outputs a time-frequency representation F×TSabsentsuperscript𝐹𝑇𝑆\in\mathbb{R}^{F\times\frac{T}{S}}∈ blackboard_R start_POSTSUPERSCRIPT italic_F × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT (F𝐹Fitalic_F = F2×Nsubscript𝐹2𝑁F_{2}\times Nitalic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_N) by concatenating FE outputs to the frequency axis.

3.2 Multi-resolution attention (MRA) block

Refer to caption
Figure 3: MRA structure. DownSample and UpSample refer to the transposed convolution and average pooling, respectively. The scale dimension of Res2Dilated Conv1D is 4.

More diverse temporal contexts may need to be considered to create the network that is robust to the length variation of the data [9, 21]. Also, the network performance benefits from a wider temporal context according to the results of [4, 32]. Considering this, we created a MRA block based on Res2Dilated block and extened dynamic scaling policy (EDSP) modules. The Res2Dilated block progressively amasses the temporal context and provides an expansive receptive field. The EDSP enables the network to function dynamically contingent on data scale. These two modules each independently focus on a wider and more diverse temporal context. Therefore, MRA block focuses more on time context to be robust to changes in utterance length.

Figure 3 illustrates the structure of the MRA block. The MRA block receives a feature C×TSabsentsuperscript𝐶𝑇𝑆\in\mathbb{R}^{C\times\frac{T}{S}}∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT as input, and extends the feature resolution range by adding low-resolution and high-resolution paths in parallel from the original. The low-resolution branch uses a down-sampling function to lower the temporal resolution of the input, and the high-resolution branch uses a up-sampling function to increase the temporal resolution of the input. We used a 1-D transposed convolution layer and an average pooling layer with kernel size 2 as the down- and up-sampling function respectively. Then, the resolution-converted inputs are processed through Res2Dilated block with α𝛼\alphaitalic_α-feature map scaling (AFMS-Res2Block) of the same structure in each branch. Applying the same structure block at different temporal resolutions means extracting features with receptive fields of different sizes. Indeed, the branch extension provides the ability to process features with various combinations of receptive fields compared to fixed single-scale branches. Thereafter, AFMS-Res2Block outputs from the low- and high-resolution paths are then converted to match the original temporal resolution TS𝑇𝑆\frac{T}{S}divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG through a sampling function.

We additionally used an attention-based gate module to concentrate on informative components between features output at different temporal resolutions. The gate focuses on enhancing the expressiveness of multi-resolution paths by modelling channel-specific relationships. Let htC×TSsubscript𝑡superscript𝐶𝑇𝑆h_{t}\in\mathbb{R}^{C\times\frac{T}{S}}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT be a feature output from the low-, high-, and original-resolution branches (1 \leq t𝑡titalic_t \leq 3). Then, a gate module output oC×TS𝑜superscript𝐶𝑇𝑆o\in\mathbb{R}^{C\times\frac{T}{S}}italic_o ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT are calculated as follows:

o=t=13(αt×ht)𝑜superscriptsubscript𝑡13subscript𝛼𝑡subscript𝑡o=\sum_{t=1}^{3}{(\alpha_{t}\times h_{t})}italic_o = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (1)

where αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes an attention score for a feature output from the low-, high-, and original-resolution branches. This attention score αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is calculated using two linear layers (W1,b1subscript𝑊1subscript𝑏1W_{1},b_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and W2,b2subscript𝑊2subscript𝑏2W_{2},b_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) as follows:

αt=exp(zt)i=13exp(zi),αtC×1formulae-sequencesubscript𝛼𝑡subscript𝑧𝑡subscriptsuperscript3𝑖1subscript𝑧𝑖subscript𝛼𝑡superscript𝐶1\alpha_{t}=\frac{\exp(z_{t})}{\sum^{3}_{i=1}{\exp(z_{i})}},\alpha_{t}\in% \mathbb{R}^{C\times 1}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 1 end_POSTSUPERSCRIPT (2)
zt=W2(σ(W1ρ(ht)+b1))+b2,ztC×1formulae-sequencesubscript𝑧𝑡subscript𝑊2𝜎subscript𝑊1𝜌subscript𝑡subscript𝑏1subscript𝑏2subscript𝑧𝑡superscript𝐶1z_{t}=W_{2}(\sigma(W_{1}\rho(h_{t})+b_{1}))+b_{2},z_{t}\in\mathbb{R}^{C\times 1}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_σ ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ρ ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 1 end_POSTSUPERSCRIPT (3)

where ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ) and σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) denote an adaptive average pooling and an activation function followed by batch normalization, respectively. Finally, the MRA block output C×TSabsentsuperscript𝐶𝑇𝑆\in\mathbb{R}^{C\times\frac{T}{S}}∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_T end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT is used by adding the output o𝑜oitalic_o and the residual, which is the input of the block.

4 Experimental setup

Table 1: Performance comparison of recently proposed speaker verification systems for short utterances. (: our implementation)
Input Feature Loss Function Data Augmentation EER(%) / MinDCF
Full 5s 2s 1s
MSEA-FPM [22] MFB-64 A-Softmax - 1.98 / 0.205 2.17 3.38 5.92
ResNet34-ANF [33] MFB-40 Softmax+PN - 1.91 / 0.221 2.04 2.88 4.49
ECAPA-TDNN [4] MFB-80 AAM-Softmax MUSAN+RIR+SpecAug 0.95 / 0.062 0.98 1.79 3.94
RawNet2 [12] Waveform Softmax - 2.43 / 0.236 2.64 3.88 7.24
RawNeXt [9] Waveform AAM-Softmax MUSAN+RIR 1.29 / 0.142 1.45 2.34 4.37
FDN-W-Res2MP [34] Waveform AAM-Softmax MUSAN+RIR 1.42 / 0.093 - - -
RawNet3 [13] Waveform AAM-Softmax MUSAN+RIR+Mask+Speed 0.89 / 0.066 0.90 1.81 4.35
MR-RawNet Waveform AAM-Softmax MUSAN+RIR+Speed 0.83 / 0.063 0.99 1.61 3.47

4.1 Datasets

We utilized the VoxCeleb1&2 [24, 25] datasets to assess our proposed framework. The VoxCeleb1 is divided into two subsets: a development set encompassing 148,642 samples from 1,211 speakers and an evaluation set comprising 4,874 samples extracted from 40 speakers. The VoxCeleb2 development set consists of 1,092,009 utterances obtained from 5,994 speakers. During the training phase, we leveraged both the development portions of VoxCeleb1 and VoxCeleb2, whereas, for the evaluation, the VoxCeleb1 test set was employed. Additionally, VOiCES development set [35] with 15,904 utterances and 196 speakers was used to conduct out-of-domain evaluation. For data augmentation techniques, we utilized the MUSAN corpus [36] and RIR reverberation datasets [37]. The performance of the models was measured using equal error rate (EER) and the minimum detection cost function (MinDCF) with PTargetsubscript𝑃𝑇𝑎𝑟𝑔𝑒𝑡P_{Target}italic_P start_POSTSUBSCRIPT italic_T italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT=0.05 and CFalseAlarmsubscript𝐶𝐹𝑎𝑙𝑠𝑒𝐴𝑙𝑎𝑟𝑚C_{FalseAlarm}italic_C start_POSTSUBSCRIPT italic_F italic_a italic_l italic_s italic_e italic_A italic_l italic_a italic_r italic_m end_POSTSUBSCRIPT=CMisssubscript𝐶𝑀𝑖𝑠𝑠C_{Miss}italic_C start_POSTSUBSCRIPT italic_M italic_i italic_s italic_s end_POSTSUBSCRIPT=1.

4.2 Configurations

We constructed a mini-batch with pre-emphasized raw waveforms of either a randomly cropped length of 3 seconds or a random length between 1 and 3 seconds, each chosen with a fifty percent probability. Evaluation utterances were cut on both sides of the center to measure performance at various lengths. Adam optimizer [38] is employed with weight decay of 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and the learning rate is scheduled between 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 3e63superscript𝑒63e^{-6}3 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT with a cosine annealing learning rate [39]. For speaker identification training, we utilized AAM-softmax [40] with a margin of 0.3 and a scale of 30. K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the kernel sizes of the first extractor in MRFE, were set to 50 and 16, respectively. We set Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT×\times×Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 800, which is equivalent to using a window size of 50 ms and a hop size of 10 ms. Further details are accessible in figures and our code 111https://github.com/kimho1wq/MR-RawNet.

5 Results

Table 1 compares the performance of our proposed framework with recently proposed SV systems for short utterances across various lengths (1s, 2s, and 5s). Although RawNet3 did not provide the performance across various utterance lengths, we measured the performance of short utterances using the official model parameters of the RawNet3. RawNet3 demonstrated comparable performance to other SV systems for short utterances, even though it exhibited superior performance for full-length utterances. The proposed MR-RawNet showed outstanding performance in various utterances, not only against raw waveform models but also against other models. Compared to the RawNet3, MR-RawNet displayed a relative error reduction (RER) rate of 20.2% in 1-second utterances. These experimental results suggest that a framework capable of exploring information across various time scales can enhance robustness against variable lengths.

We conducted ablation experiments to validate the effects of the proposed methods as shown in Table 2. These experiments were performed based on the RawNet3 structure, which served as our baseline system. Experiments #1, #2, #3, #4 applied only MRFE to the baseline, showing varying performance based on the number of MRFEs (N𝑁Nitalic_N) applied. Systems that utilized only one or two resolutions fell short when compared to the baseline. Yet, the application of three or more MRFEs surpassed the baseline performance. These outcomes suggest that the incorporation of information from various resolutions is beneficial for improving the system’s effectiveness with short utterances. Through the results of #4, we confirmed that the model could become more resilient to variable lengths by appropriately processing and utilizing information at various resolutions through different blocks. Experiments #5 and #6 apply MRA to the baseline, reflecting the alterations in MRA channels C𝐶Citalic_C and blocks B𝐵Bitalic_B. For the consistency of parameter quantities, we decreased the number of channels according to the increase in the number of MRA blocks. Both experiments achieved improved performance compared to the baseline, notably experiment #6, which set C𝐶Citalic_C and B𝐵Bitalic_B to 256 and 3, respectively. Experiments #7 and #8 represent the combination of experiments #4 with #5 and #6, respectively, and experiment #8 outperforms other experiments, recorded an RER of 15.6% for 1-second utterances against the baseline. From these results, we believe that MR-RawNet was able to improve the performance for variable duration utterances further by encouraging the model to focus on complementary temporal context through the MRFE and MRA modules.

Table 3 provides a comparison of our system and the baseline model for out-of-domain utterances of various lengths. MR-RawNet evaluation was carried out employing System #8, which showed the best performance under the VoxCeleb1 test condition. Despite the reduction in the number of parameters compared to RawNet3, MR-RawNet demonstrated improved performance across various utterance durations in the out-of-domain dataset. Specifically, MR-RawNet presented an 13.4% performance improvement over RawNet3 for 1-second utterances, thus proving its effectiveness for short utterances and the generalizability of the system.

Table 2: Results of ablation experiments. N𝑁Nitalic_N, C𝐶Citalic_C, and B𝐵Bitalic_B denote number of MRFE, channel size of MRA block, and the number of MRA block.
N𝑁Nitalic_N C𝐶Citalic_C B𝐵Bitalic_B EER(%)
Full 2s 1s
#0-Baseline ×\times× ×\times× ×\times× 1.01 1.96 4.11
#1-MRFE 1 ×\times× ×\times× 1.16 2.17 4.68
#2-MRFE 2 ×\times× ×\times× 1.04 1.95 4.27
#3-MRFE 3 ×\times× ×\times× 0.96 1.83 4.01
#4-MRFE 4 ×\times× ×\times× 0.93 1.79 3.98
#5-MRA ×\times× 384 1 1.03 1.93 4.08
#6-MRA ×\times× 256 3 0.86 1.65 3.85
#7-MR-RawNet 4 384 1 0.92 1.77 3.69
#8-MR-RawNet 4 256 3 0.83 1.61 3.47
Table 3: Results of out-of-domain experiments on the VOiCES development set.
# Params EER(%)
5s 2s 1s Avg
RawNet3 16.3M 2.52 6.98 13.48 7.66
MR-RawNet 15.5M 2.52 5.96 11.67 6.72

6 Conclusion

We proposed MR-RawNet, which is a novel speaker verification system that is robust to various duration utterances. Our system uses MRFE and MRA to improve the performance of variable duration utterances by focusing more on complementary temporal context. The results on VoxCeleb show that MR-RawNet outperformed other raw waveform-based systems, notably improving performance by 20.2% in 1-second test compared to RawNet3. Although this paper focused on comparison with the raw waveform-based models, future research could conduct comparison with models using a variety of input features.

7 Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.RS-2023-00263037, Robust deepfake audio detection development against adversarial attacks)

References

  • [1] N. Dehak, P. J. K. amd Reda Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
  • [2] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. ICASSP.   IEEE, 2014, pp. 4052–4056.
  • [3] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP.   IEEE, 2018, pp. 5329–5333.
  • [4] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn:emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in Proc. Interspeech, 2020, pp. 1–5.
  • [5] Y. Zhang, Z. Lv, H. Wu, S. Zhang, P. Hu, Z. Wu, H. yi Lee, and H. Meng, “Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification,” in Proc. Interspeech, 2022.
  • [6] H.-J. Heo, U.-H. Shin, R. Lee, Y. Lee, and H.-M. Park, “Next-tdnn: Modernizing multi-scale temporal convolution backbone for speaker verification,” arXiv preprint arXiv:2312.08603, 2023.
  • [7] S. bin Kim, J. weon Jung, H. jin Shim, J. ho Kim, and H.-J. Yu, “Segment aggregation for short utterances speaker verification using raw waveforms,” in Proc. Interspeech, 2020.
  • [8] A. Hajavi and A. Etemad, “A deep neural network for short-segment speaker recognition,” in Proc. Interspeech, 2019.
  • [9] J. ho Kim, H. jin Shim, J. Heo, and H.-J. Yu, “Rawnext: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies,” in Proc. ICASSP, 2022.
  • [10] J. weon Jung, H. soo Heo, I. ho Yang, S. hyun Yoon, H. jin Shim, and H.-J. Yu, “D-vector based speaker verification system using raw waveform cnn,” in Proc. ANIT, 2017, pp. 126–131.
  • [11] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform cldnns,” in Proc. Interspeech, 2015.
  • [12] J. weon Jung, S. bin Kim, H. jin Shim, J. ho Kim, and H.-J. Yu, “Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms,” in Proc. Interspeech, 2020.
  • [13] J. weon Jung, Y. J. Kim, H.-S. Heo, B.-J. Lee, Y. Kwon, and J. S. Chung, “Pushing the limits of raw waveform speaker recognition,” in Proc. Interspeech, 2022, pp. 2228–2232.
  • [14] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
  • [15] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460.
  • [16] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language processing, pp. 3451–3460, 2021.
  • [17] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” in Proc. SLT, 2018, pp. 1021–1028.
  • [18] M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Filterbank design for end-to-end speech separation,” in Proc. ICASSP, 2020.
  • [19] S. Han, Y. Ahn, K. Kang, and J. W. Shin, “Short-segment speaker verification using ecapa-tdnn with multi-resolution encoder,” in Proc. ICASSP, 2023.
  • [20] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-scale backbone architecture,” in Proc. TPAMI, 2019.
  • [21] H. Wang, A. Kembhavi, A. Farhadi, A. L. Yuille, and M. Rastegari, “Elastic: Improving cnns with dynamic scaling policies,” in Proc. CVPR, 2019, pp. 2258–2267.
  • [22] Y. Jung, S. M. Kye, Y. Choi, M. Jung, and H. Kim, “Improving multi-scale aggregation using feature pyramid module for robust speaker verification of variable-duration utterances,” in Proc. Interspeech, 2020.
  • [23] T. Liu, R. K. Das, K. A. Lee, and H. Li, “Mfa: Tdnn with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in Proc. ICASSP, 2022.
  • [24] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Proc. Interspeech, 2017, pp. 2616–2620.
  • [25] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
  • [26] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. CVPR, 2018, pp. 7132–7141.
  • [27] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018, pp. 2252–2256.
  • [28] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in Proc. ECCV, 2016.
  • [29] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separatio,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
  • [30] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
  • [31] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. ICCV, 2015, pp. 1026–1034.
  • [32] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in Proc. ICASSP, 2019, pp. 5796–5800.
  • [33] S. M. Kye, J. S. Chung, and H. Kim, “Supervised attention for speaker recognition,” in Proc. SLT, 2021, pp. 286–293.
  • [34] J. Li, M.-W. Mak, N. Yan, and L. Wang, “Modeling suprasegmental information using finite difference network for end-to-end speaker verification,” in Proc. APSIPA ASC, 2023.
  • [35] C. Richey, M. A.Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Graciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gamble, J. Hetherly, C. Stephenson, and K. Ni, “Voices obscured in complex environmental settings (voices) corpus,” in Proc. Interspeech, 2018, pp. 1566–1570.
  • [36] D. Snyder, G. Chen, , and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
  • [37] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. ICASSP, 2017, pp. 5220–5224.
  • [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  • [39] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in Proc. ICLR, 2017.
  • [40] J. Deng, J. Guo, J. Yang, N. Xue, I. Kotsia, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proc. CVPR, 2019, pp. 4690–4699.