\addbibresource

mybib.bib \defbibheadingbibliography[References] \DeclareSourcemap \maps[datatype=bibtex, overwrite=true] \map \step[fieldsource=booktitle, match=\regexp.*Interspeech.*, replace=Proc. Interspeech] \step[fieldsource=journal, match=\regexp.*INTERSPEECH.*, replace=Proc. Interspeech] \step[fieldsource=booktitle, match=\regexp.*ICASSP.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*icassp_inpress.*, replace=Proc. ICASSP (in press)] \step[fieldsource=booktitle, match=\regexp.*Acoustics,.*Speech.*and.*Signal.*Processing.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Learning.*Representations.*, replace=Proc. ICLR] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Computational.*Linguistics.*, replace=Proc. COLING] \step[fieldsource=booktitle, match=\regexp.*SIGdial.*Meeting.*on.*Discourse.*and.*Dialogue.*, replace=Proc. SIGDIAL] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Machine.*Learning.*, replace=Proc. ICML] \step[fieldsource=booktitle, match=\regexp.*North.*American.*Chapter.*of.*the.*Association.*for.*Computational.*Linguistics:.*Human.*Language.*Technologies.*, replace=Proc. NAACL] \step[fieldsource=booktitle, match=\regexp.*Empirical.*Methods.*in.*Natural.*Language.*Processing.*, replace=Proc. EMNLP] \step[fieldsource=booktitle, match=\regexp.*Association.*for.*Computational.*Linguistics.*, replace=Proc. ACL] \step[fieldsource=booktitle, match=\regexp.*Automatic.*Speech.*Recognition.*and.*Understanding.*, replace=Proc. ASRU] \step[fieldsource=booktitle, match=\regexp.*Spoken.*Language.*Technology.*, replace=Proc. SLT] \step[fieldsource=booktitle, match=\regexp.*Speech.*Synthesis.*Workshop.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*workshop.*on.*speech.*synthesis.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*neural.*information.*processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*Neural.*Information.*Processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Workshop.*on.* Applications.* of.* Signal.*Processing.*to.*Audio.*and.*Acoustics.*, replace=Proc. WASPAA] \step[fieldsource=publisher, match=\regexp.+, replace=] \step[fieldsource=month, match=\regexp.+, replace=] \step[fieldsource=location, match=\regexp.+, replace=] \step[fieldsource=address, match=\regexp.+, replace=] \step[fieldsource=organization, match=\regexp.+, replace=] \interspeechcameraready \name[affiliation=1]DarshanPrabhu \name[affiliation=2]YifanPeng \name[affiliation=1]PreethiJyothi \name[affiliation=2]ShinjiWatanabe

Multi-Convformer: Extending Conformer with
Multiple Convolution Kernels

Abstract

Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate (WER) improvements.

keywords:

Automatic Speech Recognition, Conformer, Multiple Convolutions, CgMLP.

1 Introduction

In recent years, end-to-end automatic speech recognition (ASR) systems have emerged as the preferred model of choice to obtain state-of-the-art ASR performance. An integral component of such systems is the speech encoder [e2e_asr_survey1, e2e_asr_survey2] that uses Attention via Transformers [attention] to map the input speech into high-level acoustic representations. Transformer-based ASR systems [speech_transformer2, speech_transformer3] often struggle to model local relationships effectively. To address this limitation, Gulati et al. [conformer] proposed the Conformer [conformer_impl] architecture, that combines multi-headed attention with convolutions [convolution, convolution2, convolution3]. With the advent of Conformer models, the idea of using convolutions alongside attention to independently model both local and global relationships has been widely explored [branchformer, e_branchformer, zipformer, leformer].

Refer to caption — Figure 1: Overview of our MultiConv encoder layer. It comprises a stacked architecture similar to Conformer, except the convolution block is replaced with a gated multi-kernel convolution block. We omit residual connections and layer norm for readability.

Despite these notable advancements, prior work [branchformer] has shown that the use of fixed-kernel convolutions within these models creates a bottleneck, forcing the model to re-purpose some of its attention heads to function as local information extractors. This negatively impacts the performance of attention, whose primary purpose is to model global information. To address this limitation, in our work, we propose a multiple convolution-based enhancement to the Conformer architecture (that we call Multi-Convformer). Additionally, we incorporate gating, a technique that has proven to be effective in encoder architectures [cgmlp, branchformer, e_branchformer]. By combining these approaches, we achieve significant improvements (up to 8% relative WER improvement) over the original Conformer architecture and perform at par or better than its variants such as CgMLP [cgmlp] and E-Branchformer [e_branchformer]. The use of multiple convolutions in order to generate better local context has been widely adopted in image-related tasks [dynamic_convolution, dynamic_convolution2, dynamic_convolution3, dynamic_convolution4, dynamic_convolution5]. Such enhancements have also been used in speech emotion recognition [ser] and robust ASR [multi_octave, multi_stream]; the latter works use multiple convolutions within fully convolutional or TDNN-style architectures (unlike our work).

In summary, our main contributions are as follows:

•

We propose Multi-Convformer, a variant of Conformer that uses multiple convolution kernels instead of a fixed kernel convolution to capture local context more effectively.¹¹1Our code is available in the ESPnet toolkit.
•

We show the effectiveness of our approach by comparing with Conformer and its variants on ASR and Spoken Language Understanding (SLU). We experiment with multiple datasets (Librispeech [librispeech], Tedlium2 [tedlium2], AISHELL [aishell] and SLURP [slurp]) and various modelling paradigms and obtain up to 8% relative WER improvement over Conformer. We also conduct several analyses and ablations to showcase the effectiveness and interpretability of our approach.

2 Methodology

In this work, we experiment with the three most popular ASR architectures: the attention-based encoder-decoder model (AED) [jointctcatt], the encoder-only model (pure CTC) [peng2024owsmctc] and the RNN-Transducer model (RNN-T) [rnnt]. An essential component common to all these architectures is the encoder (Enc) module that maps an input sequence of speech features $X=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{L}|\mathbf{x}_{i}\in% \mathbb{R}^{c}\}$ to a (typically smaller) sequence of contextualized representations $H=\textsc{Enc}(X)=\{\mathbf{h}_{1},\mathbf{h}_{2},\ldots\mathbf{h}_{T}|\mathbf% {h}_{i}\in\mathbb{R}^{d}\}$ . Thereafter, the manner in which $H$ is trained to predict the final $M$ -length ground-truth token sequence $Y=\{y_{1},y_{2},\ldots,y_{M}|y_{i}\in\mathbb{N}^{+}\}$ depends on the underlying architecture. AED uses an attention-based decoder and a CTC module to jointly learn a mapping from $H$ to $Y$ . However, pure CTC and RNN-T are encoder-only models that employ CTC [ctc] and RNN-T [rnnt] losses, respectively, to generate $Y$ . Since our modifications are constrained to the encoder, in subsequent sections, we restrict our discussion only to the composition of the Enc module.

2.1 Multi-Convformer Encoder

Figure 1 illustrates the overall architecture of a single Multi-Convformer encoder layer and Figure 2 gives an overview of how the outputs from multiple convolutions are merged together. The Multi-Convformer encoder layer consists of four blocks that are stacked together and interspersed with layer normalization [layernorm] and residual connections. The two position-wise feed-forward layers aid in refining the point-wise information, while the multi-head attention and convolution are responsible for incorporating contextual information. This stacked architecture has been widely used in prior work [conformer, branchformer, e_branchformer, squeezeformer]. We adopt this same stack, but replace the fixed single-kernel convolution block with a more expressive multi-kernel convolution module that will henceforth be referred to as MultiConv.

Method	Librispeech-100h (WER)					Tedlium2 (WER)			AISHELL (CER)
Method	# params	Dev Clean	Dev Other	Test Clean	Test Other	# params	Dev	Test	# params	Dev	Test
Attention Encoder Decoder (AED) models
Transformer [speech-transformer]	40.9M	8.02	20.14	8.39	20.34	33.8M	10.13	8.83	47.7M	5.16	5.53
Conformer [conformer]	37.4M	6.59	17.24	6.89	17.27	33.9M	9.30	7.65	54.2M	4.23	4.63
\arrayrulecolorblack!50 \arrayrulecolorblack $\textsc{MultiConv}_{\texttt{sum}}$			37.0M	6.00	\cellcolorgreen!20 16.56 $\ddagger$	6.33	\cellcolorgreen!20 16.60 $\ddagger$	33.5M	\cellcolorgreen!20 7.99 $\ddagger$	7.34	54.2M	4.18	4.47
$\textsc{MultiConv}_{\texttt{weighted}}$	37.1M	6.17	17.03	6.48	17.36	33.6M	8.15	7.41	54.4M	4.40	4.67
$\textsc{MultiConv}_{\texttt{concat}}$	37.0M	6.23	17.19	6.41	17.15	33.5M	8.32	7.42	54.2M	\cellcolorgreen!20 4.16 $\dagger$	\cellcolorgreen!20 4.46 $\dagger$
$\textsc{MultiConv}_{\texttt{depth}}$	37.2M	\cellcolorgreen!20 5.87 $\ddagger$	16.63	\cellcolorgreen!20 6.18 $\ddagger$	17.00	33.7M	8.01	\cellcolorgreen!20 7.27 $\ddagger$	54.7M	4.18	\cellcolorgreen!20 4.46
Encoder Only (Pure CTC) models
Transformer [speech-transformer]	28.9M	12.32	27.49	12.88	28.26	27.7M	11.97	11.71	28.7M	6.51	7.02
Conformer [conformer]	25.4M	9.33	22.60	9.76	23.10	24.2M	9.65	8.80	39.9M	5.98	6.59
\arrayrulecolorblack!50 \arrayrulecolorblack $\textsc{MultiConv}_{\texttt{depth}}$			25.1M	\cellcolorgreen!20 9.24	\cellcolorgreen!20 22.26 $\dagger$	\cellcolorgreen!20 9.47 $\dagger$	\cellcolorgreen!20 23.09	24.0M	\cellcolorgreen!20 8.81 $\dagger$	\cellcolorgreen!20 8.44 $\ddagger$	40.2M	\cellcolorgreen!20 5.67 $\ddagger$	\cellcolorgreen!20 6.01 $\ddagger$
RNN Transducer (RNN-T) models
Transformer [speech-transformer]	32.5M	9.18	22.40	9.43	22.95	28.7M	10.26	9.65	31.8M	5.69	6.21
Conformer [conformer]	28.9M	6.79	18.28	7.19	18.70	25.2M	8.37	7.96	43.0M	4.93	5.22
\arrayrulecolorblack!50 \arrayrulecolorblack $\textsc{MultiConv}_{\texttt{depth}}$			28.7M	\cellcolorgreen!20 6.60	\cellcolorgreen!20 17.53 $\ddagger$	\cellcolorgreen!20 7.05	\cellcolorgreen!20 18.23 $\ddagger$	24.9M	\cellcolorgreen!20 8.20	\cellcolorgreen!20 7.67	43.4M	\cellcolorgreen!20 4.65 $\ddagger$	\cellcolorgreen!20 5.05

$\displaystyle[Z_{l},Z_{r}]$	$\displaystyle=[\hat{A}[:,:d^{\prime}],\texttt{LayerNorm}(\hat{A}[:,d^{\prime}:% ])]$
$\displaystyle[V_{1},V_{2},\ldots,V_{P}]$	$\displaystyle=[\texttt{Conv}_{k_{1}}(Z_{r}),\ldots,\texttt{Conv}_{k_{P}}(Z_{r})]$
$\displaystyle\tilde{Z}_{r}$	$\displaystyle=\textsc{Fusion}([V_{1},V_{2},\ldots,V_{P}])$
$\displaystyle\hat{C}$	$\displaystyle=Z_{l}\odot\tilde{Z}_{r}$	(1)

	$\displaystyle\alpha^{s}$	$\displaystyle=\{\alpha^{s}_{1},\ldots\alpha^{s}_{P}\}=\texttt{Softmax}(\textsc% {FFN}_{d^{\prime}\rightarrow P}(Z^{s}_{r}))$
	$\displaystyle\tilde{Z}^{s}_{r}$	$\displaystyle=\alpha^{s}_{1}\cdot V^{s}_{1}+\alpha^{s}_{2}\cdot V^{s}_{2}+% \ldots+\alpha^{s}_{P}\cdot V^{s}_{P}$
	$\displaystyle\tilde{Z}_{r}$	$\displaystyle=[\tilde{Z}^{1}_{r},\tilde{Z}^{2}_{r},\ldots\tilde{Z}^{T}_{r}]$

	$\displaystyle\tilde{Z}^{s}_{r}$	$\displaystyle=\texttt{Concat}(V^{s}_{1},V^{s}_{2}\ldots,V^{s}_{P})$
	$\displaystyle\tilde{Z}_{r}$	$\displaystyle=[\tilde{Z}^{1}_{r},\tilde{Z}^{2}_{r},\ldots\tilde{Z}^{T}_{r}]$

Method	Librispeech-100h		Tedlium2	AISHELL
Method	Test Clean	Test Other	Test	Test
Attention Encoder Decoder (AED) models
CgConv	6.36	17.18	7.44	4.55
E-Branchformer [e_branchformer]	6.50	17.06	7.34	4.47
\arrayrulecolorblack!50 \arrayrulecolorblack $\textsc{MultiConv}_{\texttt{depth}}$		6.18	17.00	7.27	4.46
Encoder Only (Pure CTC) models
CgConv	9.66	23.79	8.78	6.33
E-Branchformer [e_branchformer]	10.05	23.87	8.78	6.04
\arrayrulecolorblack!50 \arrayrulecolorblack $\textsc{MultiConv}_{\texttt{depth}}$		9.47	23.09	8.44	6.01
RNN Transducer (RNN-T) models
CgConv	6.95	18.18	7.72	5.24
E-Branchformer [e_branchformer]	6.87	18.09	7.71	5.17
\arrayrulecolorblack!50 \arrayrulecolorblack $\textsc{MultiConv}_{\texttt{depth}}$		7.05	18.23	7.67	5.05

Multi-Convformer: Extending Conformer with
Multiple Convolution Kernels

Abstract

keywords:

1 Introduction

2 Methodology

2.1 Multi-Convformer Encoder

3 Experimental Setup

4 Experimental Results and Analysis

4.1 ASR and SLU Experiments

4.2 Analysis of Convolution Kernels

5 Conclusion

6 Acknowledgements

7 References

Kernels	Params	Without LM		With LM
Kernels	Params	Test Clean	Test Other	Test Clean	Test Other
Conformer [conformer]	147.8M	2.16	4.74	1.84	3.95
Branchformer [branchformer]	146.7M	2.25	4.83	1.93	4.00
E-Branchformer [e_branchformer]	148.9M	2.14	4.55	1.97	4.26
\arrayrulecolorblack!50 \arrayrulecolorblack $\textsc{MultiConv}_{\texttt{fusion}}$			147.4M	2.15	4.69	1.89	3.92

Method	Params	Intent Classification		Entity Recognition
Method	Params	Valid Acc.	Test Acc.	SLU-F1	Precision	Recall
Conformer [conformer]	109M	87.4	86.7	77.2	80.0	74.5
E-Branchformer [e_branchformer]	110M	88.3	87.4	78.5	80.7	76.5
\arrayrulecolorblack!50 \arrayrulecolorblack $\textsc{MultiConv}_{\texttt{fusion}}$			108M	88.8	87.4	78.9	80.8	77.1

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

Abstract

keywords:

1 Introduction

2 Methodology

2.1 Multi-Convformer Encoder

3 Experimental Setup

4 Experimental Results and Analysis

4.1 ASR and SLU Experiments

4.2 Analysis of Convolution Kernels

5 Conclusion

6 Acknowledgements

7 References

Multi-Convformer: Extending Conformer with
Multiple Convolution Kernels