\interspeechcameraready\name

[]Seung-binKim \name[]Chan-yeongLim \name[]JungwooHeo \name[]Ju-hoKim \name[]Hyun-seoShin \name[]Kyo-WonKoo \name[]Ha-JinYu^†

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms

Abstract

In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw waveforms. The MR-RawNet extracts time-frequency representations from raw waveforms via a multi-resolution feature extractor that optimally adjusts both temporal and spectral resolutions simultaneously. Furthermore, we apply a multi-resolution attention block that focuses on diverse and extensive temporal contexts, ensuring robustness against changes in utterance length. The experimental results, conducted on VoxCeleb1 dataset, demonstrate that the MR-RawNet exhibits superior performance in handling utterances of variable duration compared to other raw waveform-based systems.

keywords:

speaker verification, raw waveform, short duration, multi resolution

1 Introduction

Speaker verification (SV) is the task of verifying whether an anonymous speaker is the target speaker registered in the system. With the development of deep neural networks (DNNs), traditional machine learning-based SV systems have been largely replaced by DNN-based SV systems [1, 2, 3]. Despite the exceptional performance achieved by DNN-based SV systems, most systems are typically evaluated using long utterances [4, 5, 6]. However, in real environments, shorter utterances of 1 to 2 seconds are often encountered, and SV systems experience a performance degradation as the length of the utterance decreases. This degradation occurs because short utterances may not contain sufficient speaker-specific phonetic characteristics that can be obtained from speech [7, 8, 9]. Thus, SV systems should be constructed to yield similar performance across utterances of various lengths, particularly shorter ones.

To enhance robustness against variable lengths, we initially focused on the input features of the system. The inputs for SV systems can be traditional handcrafted features such as MFCCs or Mel-filter banks [2, 3, 4, 5], or raw waveforms, which have recently begun to be utilized [10, 11, 12, 13]. While handcraft features can provide refined information through processing based on human knowledge, they may also offer information loss compared to the unprocessed raw waveforms [14, 15, 16]. Especially for SV systems that consider limited information, as much information as possible should be available from the input. Therefore, we use raw waveforms as input to ensure robustness to utterances of various lengths, which have high potential due to the absence of information loss.

In traditional raw waveform-based systems, a one-dimensional input is converted into a two-dimensional feature map through a special module in the first layer, such as [17, 18]. The first-layer module extracts meaningful representations from the input, typically based on a fixed frame size. However, representations based on a constant frame size are unable to offer superior time and spectrum resolution simultaneously [19]. Consequently, these systems often prioritize spectral resolution, potentially degrading performance for various temporal utterances. At this point, we introduced a multi-resolution encoder (MRE) designed to capture information at different temporal resolutions. The MRE, a module proposed in Han $et$ $al.$ [19], extracts multiple features at various temporal resolutions. We specifically tailored this module for use in the first-layer module of a raw waveform-based system, naming it multi-resolution feature extractor (MRFE). Thus, the MRFE is able to consider both temporal and spectral resolution simultaneously,deriving features that maintain robustness across various lengths.

In addition, we propose a new bottleneck named the multi-resolution attention (MRA) block. The MRA combines the advantages of Res2Dilated block [4] and an extended dynamic scaling policy (EDSP) [9]. The Res2Dilated block gradually builds up the temporal context using dilated convolutional layers and hierarchical residual connection [20]. The EDSP, an extension of the Elastic [21], trains the network to operate dynamically based on the scale of the data. These two modules each focus on different aspects of the temporal context: the Res2Dilated block focuses on broader contexts, while the EDSP considers variety of contexts. Meanwhile, various studies have focused on exploring different temporal resolutions to effectively maximize speaker information from utterances for robustness against utterance length [22, 23]. By merging these two modules, the MRA block can consider a wider and more diverse temporal context, thereby enhancing robustness to changes in utterance length.

We finally propose a novel structure called MR-RawNet, which is robust to variations in utterance length by utilizing both the MRFE and the MRA. Experimental results on VoxCeleb1&2 datasets [24, 25] demonstrated that MR-RawNet showed superior performance for variable duration utterances in raw waveform-based systems.

2 Baseline

Refer to caption — Figure 1: (a): Baseline structure. The kernel size of Conv1D is 1, and $p$ represents max pooling size. (b): MR-RawNet structure. $k$ , $d$ , $N$ , and $B$ denote kernel size, dilation, the number of feature extractors, and the number of MRA blocks.

In a recent study on SV that utilized raw waveforms as input, RawNet3 [13] achieved remarkable performance. Nevertheless, the authors posit that there is still potential for enhancement in RawNet3. This section delineates the architecture of RawNet3 and outlines potential areas for improvement.

RawNet3 is a modified structure of the ECAPA-TDNN [4], which is widely used in SV research. As shown in Figure 1-(a), RawNet3 is composed of a parameterized analysis filterbank (ParamFbank) [18] layer and a Res2Dilated block with $\alpha$ -feature map scaling and max pooling (AFMS-Res2MP). When the one-dimensional raw waveform is fed into RawNet3, it is transformed into a two-dimensional feature map by the ParamFbank layer. The ParamFbank learns real-valued parameterised filterbanks as an extension of the SincNet layer [17]. The feature map, processed through the ParamFbank layer, is then hierarchically passed through three AFMS-Res2MP blocks to capture speaker information at various levels. The AFMS-Res2MP is a block that replaces the squeeze-excitation layer in the squeeze-excitation block [26] with the $\alpha$ -feature map scaling (AFMS) [12]. The extracted speaker information is finally processed into speaker embeddings using convolution, attentive statistics pooling (ASP) [27] with channel- and context-dependent attention values, and linear hierarchy.

We suggest that there is room for improvement in both the ParamFbank module and the AFMS-Res2MP block within RawNet3. Utilizing information from different time scales could allow them to be robust to different lengths of input speech.

3 Proposed methods

Our proposed architecture enhances the existing RawNet3 to be robust against utterances of various lengths by applying both MRFE and MRA methods. Figure 1-(b) illustrates the overall structure of our proposed system. The MR-RawNet is a structure that replaces the first convolutional module and bottleneck blocks in the baseline with The MRFE and MRA. The raw waveform of length $T$ is fed to the MRFE module to generate a time-frequency representation. The MRFE consists of $N$ feature extractors (FEs) and combines the outputs of the FEs into a feature map $o_{1}\in\mathbb{R}^{F\times\frac{T}{S}}$ , where $F$ and $S$ refer to the channel and stride size of the MRFE, respectively. The feature map $o_{1}$ is then fed into a convolutional layer for input into the bottleneck stages, and its output feature $o_{2}\in\mathbb{R}^{C\times\frac{T}{S}}$ , where $C$ refer to the channel size of the MRA block, is digested to the bottleneck stage. Each bottleneck stage consists of $B$ MRA blocks, its outputs $o_{3},o_{4},o_{5}\in\mathbb{R}^{C\times\frac{T}{S}}$ are concatenated into a feature $o_{6}\in\mathbb{R}^{3C\times\frac{T}{S}}$ , similarly to the ECAPA-TDNN. The concatenated feature $o_{6}$ is input into a convolutional layer with a kernel size of 1, followed by a ReLU function. Then, the output $o_{7}\in\mathbb{R}^{1536\times\frac{T}{S}}$ is aggregated into an utterance-level feature, and the speaker embedding is an output obtained by inputting the utterance-level features $o_{8}\in\mathbb{R}^{3072\times 1}$ into a linear layer.

3.1 Multi-resolution feature extractor

When extracting a time-frequency representation from the raw waveform, longer windows yield superior spectral resolution at the cost of temporal resolution, and the opposite is true for shorter windows. The first-layer modules of existing raw waveform-based SV systems typically used windows of a fixed frame size. However, features derived from a fixed frame size cannot concurrently yield excellent time and frequency resolution. From this perspective, we designed to combine multi-resolution information from the raw waveforms at a low-level to generate a robust time-frequency representation for different utterance lengths. The MRFE module compresses the information contained in the raw waveform at a low level. This is similar to extracting handcraft features, but has the advantage of being able to extract features suitable for a specific task using a data-driven method of DNN.

Figure 2 illustrates the structure of the MRFE. In order to extract discriminative representations from the raw waveform, we used $N$ ParamFbank layers in parallel. The $i$ -th ParamFbank layer (1 $\leq$ $i$ $\leq$ $N$ ) has a kernel size of $K_{i}$ and a stride size of $\frac{2K_{i}}{5}$ , and outputs a time-frequency feature $\in\mathbb{R}^{F_{1}\times\frac{5T}{2K_{i}}}$ . The feature is then fed to the convolutional layer with kernel size 1 (1 $\times$ 1 Conv) for input to a temporal convolutional network (TCN). The TCN is a structure proposed to replace recurrent neural network in various sequence modeling tasks, and is used in various fields to extract speech features [28, 29, 30]. The TCN consists of stacked dilated 1-D convolutional blocks, where the dilation factor increases exponentially to ensure a sufficiently large temporal context window. The 1-D convolutional block is composed of one depthwise convolution between two pointwise convolutions. The parametric rectified linear unit (PReLU) [31] and global layer normalization (gLN) [29] are used as a nonlinear activation function and a normalization technique in the blocks, respectively. We used $X$ convolutional blocks with dilation factors $1,2,\cdots,2^{X-1}$ repeated $R$ times.

The $i$ -th TCN output $\in\mathbb{R}^{F_{2}\times\frac{5T}{2K_{i}}}$ passes through a last convolutional layer, which is processed into a time-frequency representation $y_{i}\in\mathbb{R}^{F_{2}\times\frac{T}{S}}$ . The $i$ -th last convolutional layer has a kernel size of $M_{i}$ and a stride size of $\frac{M_{i}}{2}$ . We set the feature extractor (FE) stride size $S$ to $\frac{K_{i}\times M_{i}}{5}$ to ensure that outputs of all FEs have the same temporal dimension. Additionally, the $i$ -th output of TCN is also added to the ( $i$ +1)-th FE using a max pooling with a downsampling factor of 2. The kernel sizes $K_{i+1}$ and $M_{i+1}$ of the ( $i$ +1)-th FE are defined as $K_{i+1}=2K_{i}$ and $M_{i+1}=\frac{M_{i}}{2}$ , respectively. Finally, the MRFE outputs a time-frequency representation $\in\mathbb{R}^{F\times\frac{T}{S}}$ ( $F$ = $F_{2}\times N$ ) by concatenating FE outputs to the frequency axis.

3.2 Multi-resolution attention (MRA) block

More diverse temporal contexts may need to be considered to create the network that is robust to the length variation of the data [9, 21]. Also, the network performance benefits from a wider temporal context according to the results of [4, 32]. Considering this, we created a MRA block based on Res2Dilated block and extened dynamic scaling policy (EDSP) modules. The Res2Dilated block progressively amasses the temporal context and provides an expansive receptive field. The EDSP enables the network to function dynamically contingent on data scale. These two modules each independently focus on a wider and more diverse temporal context. Therefore, MRA block focuses more on time context to be robust to changes in utterance length.

Figure 3 illustrates the structure of the MRA block. The MRA block receives a feature $\in\mathbb{R}^{C\times\frac{T}{S}}$ as input, and extends the feature resolution range by adding low-resolution and high-resolution paths in parallel from the original. The low-resolution branch uses a down-sampling function to lower the temporal resolution of the input, and the high-resolution branch uses a up-sampling function to increase the temporal resolution of the input. We used a 1-D transposed convolution layer and an average pooling layer with kernel size 2 as the down- and up-sampling function respectively. Then, the resolution-converted inputs are processed through Res2Dilated block with $\alpha$ -feature map scaling (AFMS-Res2Block) of the same structure in each branch. Applying the same structure block at different temporal resolutions means extracting features with receptive fields of different sizes. Indeed, the branch extension provides the ability to process features with various combinations of receptive fields compared to fixed single-scale branches. Thereafter, AFMS-Res2Block outputs from the low- and high-resolution paths are then converted to match the original temporal resolution $\frac{T}{S}$ through a sampling function.

We additionally used an attention-based gate module to concentrate on informative components between features output at different temporal resolutions. The gate focuses on enhancing the expressiveness of multi-resolution paths by modelling channel-specific relationships. Let $h_{t}\in\mathbb{R}^{C\times\frac{T}{S}}$ be a feature output from the low-, high-, and original-resolution branches (1 $\leq$ $t$ $\leq$ 3). Then, a gate module output $o\in\mathbb{R}^{C\times\frac{T}{S}}$ are calculated as follows:

o=\sum_{t=1}^{3}{(\alpha_{t}\times h_{t})}

(1)

where $\alpha_{t}$ denotes an attention score for a feature output from the low-, high-, and original-resolution branches. This attention score $\alpha_{t}$ is calculated using two linear layers ( $W_{1},b_{1}$ and $W_{2},b_{2}$ ) as follows:

\alpha_{t}=\frac{\exp(z_{t})}{\sum^{3}_{i=1}{\exp(z_{i})}},\alpha_{t}\in% \mathbb{R}^{C\times 1}

(2)

z_{t}=W_{2}(\sigma(W_{1}\rho(h_{t})+b_{1}))+b_{2},z_{t}\in\mathbb{R}^{C\times 1}

(3)

where $\rho(\cdot)$ and $\sigma(\cdot)$ denote an adaptive average pooling and an activation function followed by batch normalization, respectively. Finally, the MRA block output $\in\mathbb{R}^{C\times\frac{T}{S}}$ is used by adding the output $o$ and the residual, which is the input of the block.

4 Experimental setup

Table 1: Performance comparison of recently proposed speaker verification systems for short utterances. (^†: our implementation)

	Input Feature	Loss Function	Data Augmentation	EER(%) / MinDCF
	Input Feature	Loss Function	Data Augmentation	Full	5s	2s	1s
MSEA-FPM [22]	MFB-64	A-Softmax	-	1.98 / 0.205	2.17	3.38	5.92
ResNet34-ANF [33]	MFB-40	Softmax+PN	-	1.91 / 0.221	2.04	2.88	4.49
ECAPA-TDNN^† [4]	MFB-80	AAM-Softmax	MUSAN+RIR+SpecAug	0.95 / 0.062	0.98	1.79	3.94
RawNet2 [12]	Waveform	Softmax	-	2.43 / 0.236	2.64	3.88	7.24
RawNeXt [9]	Waveform	AAM-Softmax	MUSAN+RIR	1.29 / 0.142	1.45	2.34	4.37
FDN-W-Res2MP [34]	Waveform	AAM-Softmax	MUSAN+RIR	1.42 / 0.093	-	-	-
RawNet3 [13]	Waveform	AAM-Softmax	MUSAN+RIR+Mask+Speed	0.89 / 0.066	0.90	1.81	4.35
MR-RawNet	Waveform	AAM-Softmax	MUSAN+RIR+Speed	0.83 / 0.063	0.99	1.61	3.47

4.1 Datasets

We utilized the VoxCeleb1&2 [24, 25] datasets to assess our proposed framework. The VoxCeleb1 is divided into two subsets: a development set encompassing 148,642 samples from 1,211 speakers and an evaluation set comprising 4,874 samples extracted from 40 speakers. The VoxCeleb2 development set consists of 1,092,009 utterances obtained from 5,994 speakers. During the training phase, we leveraged both the development portions of VoxCeleb1 and VoxCeleb2, whereas, for the evaluation, the VoxCeleb1 test set was employed. Additionally, VOiCES development set [35] with 15,904 utterances and 196 speakers was used to conduct out-of-domain evaluation. For data augmentation techniques, we utilized the MUSAN corpus [36] and RIR reverberation datasets [37]. The performance of the models was measured using equal error rate (EER) and the minimum detection cost function (MinDCF) with $P_{Target}$ =0.05 and $C_{FalseAlarm}$ = $C_{Miss}$ =1.

4.2 Configurations

We constructed a mini-batch with pre-emphasized raw waveforms of either a randomly cropped length of 3 seconds or a random length between 1 and 3 seconds, each chosen with a fifty percent probability. Evaluation utterances were cut on both sides of the center to measure performance at various lengths. Adam optimizer [38] is employed with weight decay of $5e^{-5}$ , and the learning rate is scheduled between $5e^{-4}$ and $3e^{-6}$ with a cosine annealing learning rate [39]. For speaker identification training, we utilized AAM-softmax [40] with a margin of 0.3 and a scale of 30. $K_{1}$ and $M_{1}$ , the kernel sizes of the first extractor in MRFE, were set to 50 and 16, respectively. We set $K_{i}$ $\times$ $M_{i}$ to 800, which is equivalent to using a window size of 50 ms and a hop size of 10 ms. Further details are accessible in figures and our code ¹¹1https://github.com/kimho1wq/MR-RawNet.

5 Results

Table 1 compares the performance of our proposed framework with recently proposed SV systems for short utterances across various lengths (1s, 2s, and 5s). Although RawNet3 did not provide the performance across various utterance lengths, we measured the performance of short utterances using the official model parameters of the RawNet3. RawNet3 demonstrated comparable performance to other SV systems for short utterances, even though it exhibited superior performance for full-length utterances. The proposed MR-RawNet showed outstanding performance in various utterances, not only against raw waveform models but also against other models. Compared to the RawNet3, MR-RawNet displayed a relative error reduction (RER) rate of 20.2% in 1-second utterances. These experimental results suggest that a framework capable of exploring information across various time scales can enhance robustness against variable lengths.

We conducted ablation experiments to validate the effects of the proposed methods as shown in Table 2. These experiments were performed based on the RawNet3 structure, which served as our baseline system. Experiments #1, #2, #3, #4 applied only MRFE to the baseline, showing varying performance based on the number of MRFEs ( $N$ ) applied. Systems that utilized only one or two resolutions fell short when compared to the baseline. Yet, the application of three or more MRFEs surpassed the baseline performance. These outcomes suggest that the incorporation of information from various resolutions is beneficial for improving the system’s effectiveness with short utterances. Through the results of #4, we confirmed that the model could become more resilient to variable lengths by appropriately processing and utilizing information at various resolutions through different blocks. Experiments #5 and #6 apply MRA to the baseline, reflecting the alterations in MRA channels $C$ and blocks $B$ . For the consistency of parameter quantities, we decreased the number of channels according to the increase in the number of MRA blocks. Both experiments achieved improved performance compared to the baseline, notably experiment #6, which set $C$ and $B$ to 256 and 3, respectively. Experiments #7 and #8 represent the combination of experiments #4 with #5 and #6, respectively, and experiment #8 outperforms other experiments, recorded an RER of 15.6% for 1-second utterances against the baseline. From these results, we believe that MR-RawNet was able to improve the performance for variable duration utterances further by encouraging the model to focus on complementary temporal context through the MRFE and MRA modules.

Table 3 provides a comparison of our system and the baseline model for out-of-domain utterances of various lengths. MR-RawNet evaluation was carried out employing System #8, which showed the best performance under the VoxCeleb1 test condition. Despite the reduction in the number of parameters compared to RawNet3, MR-RawNet demonstrated improved performance across various utterance durations in the out-of-domain dataset. Specifically, MR-RawNet presented an 13.4% performance improvement over RawNet3 for 1-second utterances, thus proving its effectiveness for short utterances and the generalizability of the system.

Table 2: Results of ablation experiments.

N

C

, and

B

denote number of MRFE, channel size of MRA block, and the number of MRA block.

	$N$	$C$	$B$	EER(%)
	$N$	$C$	$B$	Full	2s	1s
#0-Baseline	$\times$	$\times$	$\times$	1.01	1.96	4.11
#1-MRFE	1	$\times$	$\times$	1.16	2.17	4.68
#2-MRFE	2	$\times$	$\times$	1.04	1.95	4.27
#3-MRFE	3	$\times$	$\times$	0.96	1.83	4.01
#4-MRFE	4	$\times$	$\times$	0.93	1.79	3.98
#5-MRA	$\times$	384	1	1.03	1.93	4.08
#6-MRA	$\times$	256	3	0.86	1.65	3.85
#7-MR-RawNet	4	384	1	0.92	1.77	3.69
#8-MR-RawNet	4	256	3	0.83	1.61	3.47

Table 3: Results of out-of-domain experiments on the VOiCES development set.

	# Params	EER(%)
	# Params	5s	2s	1s	Avg
RawNet3	16.3M	2.52	6.98	13.48	7.66
MR-RawNet	15.5M	2.52	5.96	11.67	6.72

6 Conclusion

We proposed MR-RawNet, which is a novel speaker verification system that is robust to various duration utterances. Our system uses MRFE and MRA to improve the performance of variable duration utterances by focusing more on complementary temporal context. The results on VoxCeleb show that MR-RawNet outperformed other raw waveform-based systems, notably improving performance by 20.2% in 1-second test compared to RawNet3. Although this paper focused on comparison with the raw waveform-based models, future research could conduct comparison with models using a variety of input features.

7 Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.RS-2023-00263037, Robust deepfake audio detection development against adversarial attacks)

References

[1] N. Dehak, P. J. K. amd Reda Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
[2] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. ICASSP. IEEE, 2014, pp. 4052–4056.
[3] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP. IEEE, 2018, pp. 5329–5333.
[4] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn:emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in Proc. Interspeech, 2020, pp. 1–5.
[5] Y. Zhang, Z. Lv, H. Wu, S. Zhang, P. Hu, Z. Wu, H. yi Lee, and H. Meng, “Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification,” in Proc. Interspeech, 2022.
[6] H.-J. Heo, U.-H. Shin, R. Lee, Y. Lee, and H.-M. Park, “Next-tdnn: Modernizing multi-scale temporal convolution backbone for speaker verification,” arXiv preprint arXiv:2312.08603, 2023.
[7] S. bin Kim, J. weon Jung, H. jin Shim, J. ho Kim, and H.-J. Yu, “Segment aggregation for short utterances speaker verification using raw waveforms,” in Proc. Interspeech, 2020.
[8] A. Hajavi and A. Etemad, “A deep neural network for short-segment speaker recognition,” in Proc. Interspeech, 2019.
[9] J. ho Kim, H. jin Shim, J. Heo, and H.-J. Yu, “Rawnext: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies,” in Proc. ICASSP, 2022.
[10] J. weon Jung, H. soo Heo, I. ho Yang, S. hyun Yoon, H. jin Shim, and H.-J. Yu, “D-vector based speaker verification system using raw waveform cnn,” in Proc. ANIT, 2017, pp. 126–131.
[11] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform cldnns,” in Proc. Interspeech, 2015.
[12] J. weon Jung, S. bin Kim, H. jin Shim, J. ho Kim, and H.-J. Yu, “Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms,” in Proc. Interspeech, 2020.
[13] J. weon Jung, Y. J. Kim, H.-S. Heo, B.-J. Lee, Y. Kwon, and J. S. Chung, “Pushing the limits of raw waveform speaker recognition,” in Proc. Interspeech, 2022, pp. 2228–2232.
[14] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
[15] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460.
[16] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language processing, pp. 3451–3460, 2021.
[17] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” in Proc. SLT, 2018, pp. 1021–1028.
[18] M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Filterbank design for end-to-end speech separation,” in Proc. ICASSP, 2020.
[19] S. Han, Y. Ahn, K. Kang, and J. W. Shin, “Short-segment speaker verification using ecapa-tdnn with multi-resolution encoder,” in Proc. ICASSP, 2023.
[20] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-scale backbone architecture,” in Proc. TPAMI, 2019.
[21] H. Wang, A. Kembhavi, A. Farhadi, A. L. Yuille, and M. Rastegari, “Elastic: Improving cnns with dynamic scaling policies,” in Proc. CVPR, 2019, pp. 2258–2267.
[22] Y. Jung, S. M. Kye, Y. Choi, M. Jung, and H. Kim, “Improving multi-scale aggregation using feature pyramid module for robust speaker verification of variable-duration utterances,” in Proc. Interspeech, 2020.
[23] T. Liu, R. K. Das, K. A. Lee, and H. Li, “Mfa: Tdnn with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in Proc. ICASSP, 2022.
[24] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Proc. Interspeech, 2017, pp. 2616–2620.
[25] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
[26] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. CVPR, 2018, pp. 7132–7141.
[27] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018, pp. 2252–2256.
[28] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in Proc. ECCV, 2016.
[29] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separatio,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
[30] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. ICCV, 2015, pp. 1026–1034.
[32] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in Proc. ICASSP, 2019, pp. 5796–5800.
[33] S. M. Kye, J. S. Chung, and H. Kim, “Supervised attention for speaker recognition,” in Proc. SLT, 2021, pp. 286–293.
[34] J. Li, M.-W. Mak, N. Yan, and L. Wang, “Modeling suprasegmental information using finite difference network for end-to-end speaker verification,” in Proc. APSIPA ASC, 2023.
[35] C. Richey, M. A.Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Graciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gamble, J. Hetherly, C. Stephenson, and K. Ni, “Voices obscured in complex environmental settings (voices) corpus,” in Proc. Interspeech, 2018, pp. 1566–1570.
[36] D. Snyder, G. Chen, , and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[37] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. ICASSP, 2017, pp. 5220–5224.
[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
[39] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in Proc. ICLR, 2017.
[40] J. Deng, J. Guo, J. Yang, N. Xue, I. Kotsia, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proc. CVPR, 2019, pp. 4690–4699.