\addbibresource

mybib.bib \defbibheadingbibliography[References] \DeclareSourcemap \maps[datatype=bibtex, overwrite=true] \map \step[fieldsource=booktitle, match=\regexp.*Interspeech.*, replace=Proc. Interspeech] \step[fieldsource=journal, match=\regexp.*INTERSPEECH.*, replace=Proc. Interspeech] \step[fieldsource=booktitle, match=\regexp.*ICASSP.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*icassp_inpress.*, replace=Proc. ICASSP (in press)] \step[fieldsource=booktitle, match=\regexp.*Acoustics,.*Speech.*and.*Signal.*Processing.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Learning.*Representations.*, replace=Proc. ICLR] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Computational.*Linguistics.*, replace=Proc. COLING] \step[fieldsource=booktitle, match=\regexp.*SIGdial.*Meeting.*on.*Discourse.*and.*Dialogue.*, replace=Proc. SIGDIAL] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Machine.*Learning.*, replace=Proc. ICML] \step[fieldsource=booktitle, match=\regexp.*North.*American.*Chapter.*of.*the.*Association.*for.*Computational.*Linguistics:.*Human.*Language.*Technologies.*, replace=Proc. NAACL] \step[fieldsource=booktitle, match=\regexp.*Empirical.*Methods.*in.*Natural.*Language.*Processing.*, replace=Proc. EMNLP] \step[fieldsource=booktitle, match=\regexp.*Association.*for.*Computational.*Linguistics.*, replace=Proc. ACL] \step[fieldsource=booktitle, match=\regexp.*Automatic.*Speech.*Recognition.*and.*Understanding.*, replace=Proc. ASRU] \step[fieldsource=booktitle, match=\regexp.*Spoken.*Language.*Technology.*, replace=Proc. SLT] \step[fieldsource=booktitle, match=\regexp.*Speech.*Synthesis.*Workshop.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*workshop.*on.*speech.*synthesis.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*neural.*information.*processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*Neural.*Information.*Processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Workshop.*on.* Applications.* of.* Signal.*Processing.*to.*Audio.*and.*Acoustics.*, replace=Proc. WASPAA] \step[fieldsource=publisher, match=\regexp.+, replace=] \step[fieldsource=month, match=\regexp.+, replace=] \step[fieldsource=location, match=\regexp.+, replace=] \step[fieldsource=address, match=\regexp.+, replace=] \step[fieldsource=organization, match=\regexp.+, replace=] \interspeechcameraready \name[affiliation=1]DarshanPrabhu \name[affiliation=2]YifanPeng \name[affiliation=1]PreethiJyothi \name[affiliation=2]ShinjiWatanabe

Multi-Convformer: Extending Conformer with
Multiple Convolution Kernels

Abstract

Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer  that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate (WER) improvements.

keywords:
Automatic Speech Recognition, Conformer, Multiple Convolutions, CgMLP.

1 Introduction

In recent years, end-to-end automatic speech recognition (ASR) systems have emerged as the preferred model of choice to obtain state-of-the-art ASR performance. An integral component of such systems is the speech encoder [e2e_asr_survey1, e2e_asr_survey2] that uses Attention via Transformers [attention] to map the input speech into high-level acoustic representations. Transformer-based ASR systems [speech_transformer2, speech_transformer3] often struggle to model local relationships effectively. To address this limitation, Gulati et al. [conformer] proposed the Conformer [conformer_impl] architecture, that combines multi-headed attention with convolutions [convolution, convolution2, convolution3]. With the advent of Conformer models, the idea of using convolutions alongside attention to independently model both local and global relationships has been widely explored [branchformer, e_branchformer, zipformer, leformer].

Refer to caption
Figure 1: Overview of our MultiConv encoder layer. It comprises a stacked architecture similar to Conformer, except the convolution block is replaced with a gated multi-kernel convolution block. We omit residual connections and layer norm for readability.

Despite these notable advancements, prior work [branchformer] has shown that the use of fixed-kernel convolutions within these models creates a bottleneck, forcing the model to re-purpose some of its attention heads to function as local information extractors. This negatively impacts the performance of attention, whose primary purpose is to model global information. To address this limitation, in our work, we propose a multiple convolution-based enhancement to the Conformer architecture (that we call Multi-Convformer). Additionally, we incorporate gating, a technique that has proven to be effective in encoder architectures [cgmlp, branchformer, e_branchformer]. By combining these approaches, we achieve significant improvements (up to 8% relative WER improvement) over the original Conformer architecture and perform at par or better than its variants such as CgMLP [cgmlp] and E-Branchformer [e_branchformer]. The use of multiple convolutions in order to generate better local context has been widely adopted in image-related tasks [dynamic_convolution, dynamic_convolution2, dynamic_convolution3, dynamic_convolution4, dynamic_convolution5]. Such enhancements have also been used in speech emotion recognition [ser] and robust ASR [multi_octave, multi_stream]; the latter works use multiple convolutions within fully convolutional or TDNN-style architectures (unlike our work).

Refer to caption
Figure 2: Overview of the four variants of the Fusion() operation in our proposed Multi-kernel Convolutional Spatial Gating Unit (M-CSGU) block that demonstrates how convolution kernels with K={7,15,23,31}𝐾7152331K=\{7,15,23,31\}italic_K = { 7 , 15 , 23 , 31 } are used in our MultiConv block. Briefly, the four architectures are as follows: In (a) element-wise addition is employed on the outputs of the convolutions to generate the final output. (b) builds upon (a) by learning weights that determine the importance of each convolution’s output. (c) uses compression kernels to reduce the input by 1/4th1superscript4th1/4^{\text{th}}1 / 4 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT and then concatenates all of them frame-wise to generate the output. (d) further enhances (c) by using an additional depthwise convolution at the end that takes neighboring frames into account when generating the final output.

In summary, our main contributions are as follows:

  • We propose Multi-Convformer, a variant of Conformer that uses multiple convolution kernels instead of a fixed kernel convolution to capture local context more effectively.111Our code is available in the ESPnet toolkit.

  • We show the effectiveness of our approach by comparing with Conformer and its variants on ASR and Spoken Language Understanding (SLU). We experiment with multiple datasets (Librispeech [librispeech], Tedlium2 [tedlium2], AISHELL [aishell] and SLURP [slurp]) and various modelling paradigms and obtain up to 8% relative WER improvement over Conformer. We also conduct several analyses and ablations to showcase the effectiveness and interpretability of our approach.

2 Methodology

In this work, we experiment with the three most popular ASR architectures: the attention-based encoder-decoder model (AED) [jointctcatt], the encoder-only model (pure CTC) [peng2024owsmctc] and the RNN-Transducer model (RNN-T) [rnnt]. An essential component common to all these architectures is the encoder (Enc) module that maps an input sequence of speech featuresX={𝐱1,𝐱2,,𝐱L|𝐱ic}𝑋conditional-setsubscript𝐱1subscript𝐱2subscript𝐱𝐿subscript𝐱𝑖superscript𝑐X=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{L}|\mathbf{x}_{i}\in% \mathbb{R}^{c}\}italic_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } to a (typically smaller) sequence of contextualized representations H=Enc(X)={𝐡1,𝐡2,𝐡T|𝐡id}𝐻Enc𝑋conditional-setsubscript𝐡1subscript𝐡2subscript𝐡𝑇subscript𝐡𝑖superscript𝑑H=\textsc{Enc}(X)=\{\mathbf{h}_{1},\mathbf{h}_{2},\ldots\mathbf{h}_{T}|\mathbf% {h}_{i}\in\mathbb{R}^{d}\}italic_H = Enc ( italic_X ) = { bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }. Thereafter, the manner in which H𝐻Hitalic_H is trained to predict the final M𝑀Mitalic_M-length ground-truth token sequence Y={y1,y2,,yM|yi+}𝑌conditional-setsubscript𝑦1subscript𝑦2subscript𝑦𝑀subscript𝑦𝑖superscriptY=\{y_{1},y_{2},\ldots,y_{M}|y_{i}\in\mathbb{N}^{+}\}italic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT } depends on the underlying architecture. AED uses an attention-based decoder and a CTC module to jointly learn a mapping from H𝐻Hitalic_H to Y𝑌Yitalic_Y. However, pure CTC and RNN-T are encoder-only models that employ CTC [ctc] and RNN-T [rnnt] losses, respectively, to generate Y𝑌Yitalic_Y. Since our modifications are constrained to the encoder, in subsequent sections, we restrict our discussion only to the composition of the Enc module.

2.1 Multi-Convformer Encoder

Figure 1 illustrates the overall architecture of a single Multi-Convformer encoder layer and Figure 2 gives an overview of how the outputs from multiple convolutions are merged together. The Multi-Convformer encoder layer consists of four blocks that are stacked together and interspersed with layer normalization [layernorm] and residual connections. The two position-wise feed-forward layers aid in refining the point-wise information, while the multi-head attention and convolution are responsible for incorporating contextual information. This stacked architecture has been widely used in prior work [conformer, branchformer, e_branchformer, squeezeformer]. We adopt this same stack, but replace the fixed single-kernel convolution block with a more expressive multi-kernel convolution module that will henceforth be referred to as MultiConv.

Table 1: Comparing the performance (CER or WER %) of our proposed system against Transformer and Conformer on three datasets. The three sections are: (1) AED: Encoder-Decoder models with joint CTC-Attention loss, (2) Pure CTC: Encoder-only model where only CTC loss is employed and (3) RNN-T: RNN based Transducer model.    denotes the best CER / WER across all experiments. \dagger and \ddagger indicate statistically significant results compared to Conformer at p<0.05𝑝0.05p<0.05italic_p < 0.05 and p<0.001𝑝0.001p<0.001italic_p < 0.001 using MAPSSWE test [mapsswe], respectively.
Method Librispeech-100h (WER) Tedlium2 (WER) AISHELL (CER)
# params Dev Clean Dev Other Test Clean Test Other # params Dev Test # params Dev Test
Attention Encoder Decoder (AED) models
Transformer [speech-transformer] 40.9M 8.02 20.14 8.39 20.34 33.8M 10.13 8.83 47.7M 5.16 5.53
Conformer [conformer] 37.4M 6.59 17.24 6.89 17.27 33.9M 9.30 7.65 54.2M 4.23 4.63
\arrayrulecolorblack!50 \arrayrulecolorblack MultiConvsumsubscriptMultiConvsum\textsc{MultiConv}_{\texttt{sum}}MultiConv start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT 37.0M 6.00 \cellcolorgreen!20 16.56 \ddagger 6.33 \cellcolorgreen!20 16.60 \ddagger 33.5M \cellcolorgreen!20 7.99 \ddagger 7.34 54.2M 4.18 4.47
MultiConvweightedsubscriptMultiConvweighted\textsc{MultiConv}_{\texttt{weighted}}MultiConv start_POSTSUBSCRIPT weighted end_POSTSUBSCRIPT 37.1M 6.17 17.03 6.48 17.36 33.6M 8.15 7.41 54.4M 4.40 4.67
MultiConvconcatsubscriptMultiConvconcat\textsc{MultiConv}_{\texttt{concat}}MultiConv start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT 37.0M 6.23 17.19 6.41 17.15 33.5M 8.32 7.42 54.2M \cellcolorgreen!20 4.16 \dagger \cellcolorgreen!20 4.46 \dagger
MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT 37.2M \cellcolorgreen!20 5.87 \ddagger 16.63 \cellcolorgreen!20 6.18  \ddagger 17.00 33.7M 8.01 \cellcolorgreen!20 7.27 \ddagger 54.7M 4.18 \cellcolorgreen!20 4.46
Encoder Only (Pure CTC) models
Transformer [speech-transformer] 28.9M 12.32 27.49 12.88 28.26 27.7M 11.97 11.71 28.7M 6.51 7.02
Conformer [conformer] 25.4M 9.33 22.60 9.76 23.10 24.2M 9.65 8.80 39.9M 5.98 6.59
\arrayrulecolorblack!50 \arrayrulecolorblack MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT 25.1M \cellcolorgreen!20 9.24 \cellcolorgreen!20 22.26 \dagger \cellcolorgreen!20 9.47 \dagger \cellcolorgreen!20 23.09 24.0M \cellcolorgreen!20 8.81 \dagger \cellcolorgreen!20 8.44 \ddagger 40.2M \cellcolorgreen!20 5.67 \ddagger \cellcolorgreen!20 6.01 \ddagger
RNN Transducer (RNN-T) models
Transformer [speech-transformer] 32.5M 9.18 22.40 9.43 22.95 28.7M 10.26 9.65 31.8M 5.69 6.21
Conformer [conformer] 28.9M 6.79 18.28 7.19 18.70 25.2M 8.37 7.96 43.0M 4.93 5.22
\arrayrulecolorblack!50 \arrayrulecolorblack MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT 28.7M \cellcolorgreen!20 6.60 \cellcolorgreen!20 17.53 \ddagger \cellcolorgreen!20 7.05 \cellcolorgreen!20 18.23 \ddagger 24.9M \cellcolorgreen!20 8.20 \cellcolorgreen!20 7.67 43.4M \cellcolorgreen!20 4.65 \ddagger \cellcolorgreen!20 5.05

As illustrated in Figure 1, in a single encoder layer, the output from the multi-head attention block A={𝐚1,𝐚2,,𝐚T|𝐚jd}𝐴conditional-setsubscript𝐚1subscript𝐚2subscript𝐚𝑇subscript𝐚𝑗superscript𝑑A=\{\mathbf{a}_{1},\mathbf{a}_{2},\ldots,\mathbf{a}_{T}\ |\mathbf{a}_{j}\in% \mathbb{R}^{d}\}italic_A = { bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT } is first normalized with a layer normalization. Subsequently, it undergoes channel projection to increase its dimensionality from d𝑑ditalic_d to dintersubscript𝑑interd_{\text{inter}}italic_d start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT222Here dinter>dsubscript𝑑inter𝑑d_{\text{inter}}>ditalic_d start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT > italic_d. Typically, dinter=6dsubscript𝑑inter6𝑑d_{\text{inter}}=6ditalic_d start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT = 6 italic_d. Using a higher intermediate dimension has been shown to be an effective strategy for position-wise feedforward layers [positionwise_ff].. To introduce non-linearity to the representation, we apply GELU [gelu] activation. The resulting output, A^={𝐚^1,𝐚^2,,𝐚^T|𝐚^jdinter}^𝐴conditional-setsubscript^𝐚1subscript^𝐚2subscript^𝐚𝑇subscript^𝐚𝑗superscriptsubscript𝑑inter\hat{A}=\{\mathbf{\hat{a}}_{1},\mathbf{\hat{a}}_{2},\ldots,\mathbf{\hat{a}}_{T% }|\mathbf{\hat{a}}_{j}\in\mathbb{R}^{d_{\text{inter}}}\}over^ start_ARG italic_A end_ARG = { over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }, is then passed through our Multi-kernel Convolutional Spatial Gating Unit (M-CSGU). M-CSGU is a more powerful alternative [branchformer, e_branchformer] to the standard convolution block due to its usage of gates [gated_mlp] along with convolutions. Finally, we employ another channel projection layer that projects the output from dintersuperscriptsubscript𝑑inter\mathbb{R}^{d_{\text{inter}}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT end_POSTSUPERSCRIPT back to dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, followed by dropout for regularization.

Multi-kernel Convolutional Spatial Gating Unit (M-CSGU): M-CSGU module takes A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG as its input. We first bifurcate each representation in A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG into two parts, each having dimension d=dinter/2superscript𝑑subscript𝑑inter2d^{\prime}=d_{\text{inter}}/2italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_d start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT / 2. Only one part undergoes layer normalization and passes through multiple convolutions; the other part stays intact. These parts are multiplied element-wise, thus creating a gate. Next, we use P𝑃Pitalic_P depthwise convolutions with kernel sizes K={k1,k2,,kP}𝐾subscript𝑘1subscript𝑘2subscript𝑘𝑃K=\{k_{1},k_{2},\ldots,k_{P}\}italic_K = { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT }. Formally, these operations are as follows:

[Zl,Zr]subscript𝑍𝑙subscript𝑍𝑟\displaystyle[Z_{l},Z_{r}][ italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] =[A^[:,:d],LayerNorm(A^[:,d:])]\displaystyle=[\hat{A}[:,:d^{\prime}],\texttt{LayerNorm}(\hat{A}[:,d^{\prime}:% ])]= [ over^ start_ARG italic_A end_ARG [ : , : italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] , LayerNorm ( over^ start_ARG italic_A end_ARG [ : , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : ] ) ]
[V1,V2,,VP]subscript𝑉1subscript𝑉2subscript𝑉𝑃\displaystyle[V_{1},V_{2},\ldots,V_{P}][ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] =[Convk1(Zr),,ConvkP(Zr)]absentsubscriptConvsubscript𝑘1subscript𝑍𝑟subscriptConvsubscript𝑘𝑃subscript𝑍𝑟\displaystyle=[\texttt{Conv}_{k_{1}}(Z_{r}),\ldots,\texttt{Conv}_{k_{P}}(Z_{r})]= [ Conv start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , … , Conv start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ]
Z~rsubscript~𝑍𝑟\displaystyle\tilde{Z}_{r}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =Fusion([V1,V2,,VP])absentFusionsubscript𝑉1subscript𝑉2subscript𝑉𝑃\displaystyle=\textsc{Fusion}([V_{1},V_{2},\ldots,V_{P}])= Fusion ( [ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] )
C^^𝐶\displaystyle\hat{C}over^ start_ARG italic_C end_ARG =ZlZ~rabsentdirect-productsubscript𝑍𝑙subscript~𝑍𝑟\displaystyle=Z_{l}\odot\tilde{Z}_{r}= italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⊙ over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (1)

where A^,C^T×dinter^𝐴^𝐶superscript𝑇subscript𝑑inter\hat{A},\hat{C}\in\mathbb{R}^{T\times d_{\text{inter}}}over^ start_ARG italic_A end_ARG , over^ start_ARG italic_C end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, Zl,Zr,Vj,Z~rT×dsubscript𝑍𝑙subscript𝑍𝑟subscript𝑉𝑗subscript~𝑍𝑟superscript𝑇superscript𝑑Z_{l},Z_{r},V_{j},\tilde{Z}_{r}\in\mathbb{R}^{T\times d^{\prime}}italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, direct-product\odot represents element-wise products, Convki()subscriptConvsubscript𝑘𝑖\texttt{Conv}_{k_{i}}()Conv start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ) refers to a depthwise convolution with kernel size of kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG is the final output from this block which is further passed to a position-wise feed-forward layer. Since Fusion must preserve the dimensionality of the input, the number of input channels to each Conv becomes dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, however the size of the output channels depends on the nature of the Fusion()Fusion\textsc{Fusion}()Fusion ( ) operation. We explore four different fusion mechanisms, shown in Figure 2, that we can further group into two categories: Sum-based and Concat-based Fusion.

Sum-based Fusion: In this mechanism, both the input and output channels are of the same size. The outputs obtained from each convolution are combined using an element-wise addition operation as shown in Figure 2(a). That is, Z~rsubscript~𝑍𝑟\tilde{Z}_{r}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in Equation 1 is computed as: Z~r=V1+V2+VPsubscript~𝑍𝑟subscript𝑉1subscript𝑉2subscript𝑉𝑃\tilde{Z}_{r}=V_{1}+V_{2}\ldots+V_{P}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … + italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT. In our experiments, we refer to this fusion approach as MultiConvsumsubscriptMultiConvsum\textsc{MultiConv}_{\texttt{sum}}MultiConv start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT. We further enhance this fusion by learning weights that decide the importance of each convolution for every frame of the input, as shown in Figure 2(b) and defined formally below:

αssuperscript𝛼𝑠\displaystyle\alpha^{s}italic_α start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ={α1s,αPs}=Softmax(FFNdP(Zrs))absentsubscriptsuperscript𝛼𝑠1subscriptsuperscript𝛼𝑠𝑃SoftmaxsubscriptFFNsuperscript𝑑𝑃subscriptsuperscript𝑍𝑠𝑟\displaystyle=\{\alpha^{s}_{1},\ldots\alpha^{s}_{P}\}=\texttt{Softmax}(\textsc% {FFN}_{d^{\prime}\rightarrow P}(Z^{s}_{r}))= { italic_α start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_α start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT } = Softmax ( FFN start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_P end_POSTSUBSCRIPT ( italic_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) )
Z~rssubscriptsuperscript~𝑍𝑠𝑟\displaystyle\tilde{Z}^{s}_{r}over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =α1sV1s+α2sV2s++αPsVPsabsentsubscriptsuperscript𝛼𝑠1subscriptsuperscript𝑉𝑠1subscriptsuperscript𝛼𝑠2subscriptsuperscript𝑉𝑠2subscriptsuperscript𝛼𝑠𝑃subscriptsuperscript𝑉𝑠𝑃\displaystyle=\alpha^{s}_{1}\cdot V^{s}_{1}+\alpha^{s}_{2}\cdot V^{s}_{2}+% \ldots+\alpha^{s}_{P}\cdot V^{s}_{P}= italic_α start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_α start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ⋅ italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT
Z~rsubscript~𝑍𝑟\displaystyle\tilde{Z}_{r}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =[Z~r1,Z~r2,Z~rT]absentsubscriptsuperscript~𝑍1𝑟subscriptsuperscript~𝑍2𝑟subscriptsuperscript~𝑍𝑇𝑟\displaystyle=[\tilde{Z}^{1}_{r},\tilde{Z}^{2}_{r},\ldots\tilde{Z}^{T}_{r}]= [ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , … over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ]

where FFNabsubscriptFFN𝑎𝑏\textsc{FFN}_{a\rightarrow b}FFN start_POSTSUBSCRIPT italic_a → italic_b end_POSTSUBSCRIPT is a projection layer that projects a𝑎aitalic_a-dimensional inputs to b𝑏bitalic_b-dimensional outputs and αj[0,1]subscript𝛼𝑗01\alpha_{j}\in[0,1]italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is the importance given by the sthsuperscript𝑠𝑡s^{th}italic_s start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame Zrssubscriptsuperscript𝑍𝑠𝑟Z^{s}_{r}italic_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to the output Vjssubscriptsuperscript𝑉𝑠𝑗V^{s}_{j}italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT kernel. We refer to this fusion mechanism as MultiConvweightedsubscriptMultiConvweighted\textsc{MultiConv}_{\texttt{weighted}}MultiConv start_POSTSUBSCRIPT weighted end_POSTSUBSCRIPT in our experiments.

Concat-based Fusion: In contrast to sum, the concat-based fusion allocates a portion of the output to each convolution, causing the number of input and output channels for these convolutions to be different. This reconstruction allocates 1/P1𝑃1/P1 / italic_P of the total number of features dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to each convolution, resulting in the number of output channels to be d/Psuperscript𝑑𝑃d^{\prime}/Pitalic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_P as shown in Figure 2(c). This operation is shown below:

Z~rssubscriptsuperscript~𝑍𝑠𝑟\displaystyle\tilde{Z}^{s}_{r}over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =Concat(V1s,V2s,VPs)absentConcatsubscriptsuperscript𝑉𝑠1subscriptsuperscript𝑉𝑠2subscriptsuperscript𝑉𝑠𝑃\displaystyle=\texttt{Concat}(V^{s}_{1},V^{s}_{2}\ldots,V^{s}_{P})= Concat ( italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT )
Z~rsubscript~𝑍𝑟\displaystyle\tilde{Z}_{r}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =[Z~r1,Z~r2,Z~rT]absentsubscriptsuperscript~𝑍1𝑟subscriptsuperscript~𝑍2𝑟subscriptsuperscript~𝑍𝑇𝑟\displaystyle=[\tilde{Z}^{1}_{r},\tilde{Z}^{2}_{r},\ldots\tilde{Z}^{T}_{r}]= [ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , … over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ]

where Vjs,Z~rsd/Psubscriptsuperscript𝑉𝑠𝑗subscriptsuperscript~𝑍𝑠𝑟superscriptsuperscript𝑑𝑃V^{s}_{j},\tilde{Z}^{s}_{r}\in\mathbb{R}^{d^{\prime}/P}italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_P end_POSTSUPERSCRIPT. In our experiments, we call this MultiConvconcatsubscriptMultiConvconcat\textsc{MultiConv}_{\texttt{concat}}MultiConv start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT. We further enhance this fusion by introducing a depthwise convolution after the frame reconstruction to take neighboring frames into account while combining the convolution outputs, as shown in Figure 2(d). We refer to this architecture as MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT in our experiments.

3 Experimental Setup

We conduct experiments on five datasets namely Librispeech-100h (LS-100) [librispeech], Librispeech-960h (LS-960) [librispeech], Tedlium-2 [tedlium2], AISHELL-1 [aishell] and SLURP [slurp]. We use ESPnet toolkit [espnet] to run all our experiments on a combination of NVIDIA RTX A6000 and NVIDIA Tesla V100 GPUs.333We ensure same GPU and environment is used while running all the experiments on a particular dataset. All our models take 80808080-dimensional log-Mel features as input that are extracted with a 25ms window size and 10ms stride. We also use 3-way speed perturbation with ratios {0.9,1.0,1.1}0.91.01.1\{0.9,1.0,1.1\}{ 0.9 , 1.0 , 1.1 } and SpecAugment [specaug]. In all our experiments, we use the experimental settings recommended in ESPnet recipes. For all datasets except LS-960 and SLURP, the encoder-decoder architecture consists of 12121212 encoder and 6666 decoder layers with an attention dimension of d=256𝑑256d=256italic_d = 256 and 4444 attention heads. However, for LS-960 and SLURP, we use an 18181818 layer encoder with 8888 attention heads and an attention dimension of d=512𝑑512d=512italic_d = 512.

4 Experimental Results and Analysis

Table 2.1 compares all four variants of our proposed Multi-Convformer (elaborated in Section 2.1) to Transformer [speech-transformer] and Conformer [conformer] models. Our method significantly outperforms both these approaches across all three datasets and multiple modelling paradigms; results that are statistically significant are shown with \ddagger. Additionally, we find that among the four Fusion methods (Figure 2), MultiConvsumsubscriptMultiConvsum\textsc{MultiConv}_{\texttt{sum}}MultiConv start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT and MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT perform better. Henceforth, we will only show results with using the MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT fusion strategy.

4.1 ASR and SLU Experiments

Table 2: Comparison of our system (WER %) with CgConv and E-Branchformer (reproduced using ESPnet recipes [espnet]) on the test splits of the three datasets.
Method Librispeech-100h Tedlium2 AISHELL
Test Clean Test Other Test Test
Attention Encoder Decoder (AED) models
CgConv 6.36 17.18 7.44 4.55
E-Branchformer [e_branchformer] 6.50 17.06 7.34 4.47
\arrayrulecolorblack!50 \arrayrulecolorblack MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT 6.18 17.00 7.27 4.46
Encoder Only (Pure CTC) models
CgConv 9.66 23.79 8.78 6.33
E-Branchformer [e_branchformer] 10.05 23.87 8.78 6.04
\arrayrulecolorblack!50 \arrayrulecolorblack MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT 9.47 23.09 8.44 6.01
RNN Transducer (RNN-T) models
CgConv 6.95 18.18 7.72 5.24
E-Branchformer [e_branchformer] 6.87 18.09 7.71 5.17
\arrayrulecolorblack!50 \arrayrulecolorblack MultiConvdepthsubscriptMultiConvdepth\textsc{MultiConv}_{\texttt{depth}}MultiConv start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT 7.05 18.23 7.67 5.05
Refer to caption
Figure 3: Line plot showing the degree of diagonality seen in each encoder layer’s self-attention block. A higher value indicates that the attention weight matrix is more concentrated along its diagonal [attnetion_usefullness].

Comparison with Conformer variants. Table 4.1 shows the WER comparison between our system and two Conformer variants: CgConv (Conformer with convolution replaced by Convolutional Spatial Gating Unit [cgmlp])444This is a stronger variant of the original CgMLP architecture [cgmlp], which is a pure MLP-based system. and E-Branchformer (Conformer with disentangled attention and convolution branches) [e_branchformer, comparative_study_branchformer]. We note here that MultiConvsumsubscriptMultiConvsum\textsc{MultiConv}_{\texttt{sum}}MultiConv start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT can be considered as an improved version of CgConv, as it reduces to CgConv when a single fixed convolution kernel is employed instead of multiple kernels. Further, we find that the use of a gate in conjunction with convolution allows the model to selectively utilize convolution, thereby introducing a natural branching capability. As a result, our proposed architecture can be viewed as an enhancement to CgConv and a parameter-efficient variant of Branchformer that achieves comparable or better performance on speech-related tasks.

In Figure 3, we compare the diagonal properties of self-attention blocks among Transformer, Conformer, and Multi-Convformer via the diagonality metric [attnetion_usefullness, branchformer]. This metric is an indication of the degree to which attention heads focus on capturing local information rather than global information. We find that Multi-Convformer allows for more global self-attention blocks, resulting in a decreased diagonality value when compared to both Transformer and Conformer, yielding 17% and 3% relative reductions in average diagonality values across all layers, respectively.

Performance on Librispeech-960h. Table 4.1 compares WERs of our proposed system with other architectures on the full Librispeech [librispeech] dataset. For baselines, we reuse the numbers reported by Peng et al. [e_branchformer]. We evaluate with and without the use of an external language model (LM) during inference. In both settings, our model is on par with state-of-the-art architectures achieving comparable or slightly better performance when trained with large amounts of data.

SLU experiments. To evaluate the effectiveness of our proposed approach on tasks other than ASR, in Table 4.1 we evaluate our method on SLU with the SLURP [slurp] dataset. Multi-Convformer outperforms both Conformer and E-Branchformer achieving the best accuracy and F1-score for both intent classification and entity recognition tasks, while having the least number of parameters.

Summary of results. In small-scale training data scenarios for ASR (Librispeech-100h, Tedlium2 and AISHELL shown in Table 2.1) and SLU (SLURP), Multi-Convformer performs consistently better than Conformer. Moreover, our proposed system exhibits comparable or improved performance when compared to other state-of-the-art Conformer variants such as CgMLP and E-Branchformer. With large scale datasets like Librispeech-960h, we find Multi-Convformer to be on par with all these architectures.

Refer to caption
Figure 4: Heatmap showing the importance given to each convolution kernel across all encoder layers. A dark blue cell indicates a high level of importance given to a specific kernel.
Table 3: Comparison of the performance (WER %) of our system against other architectures on the full Librispeech dataset.
Kernels Params Without LM With LM
Test Clean Test Other Test Clean Test Other
Conformer [conformer] 147.8M 2.16 4.74 1.84 3.95
Branchformer [branchformer] 146.7M 2.25 4.83 1.93 4.00
E-Branchformer [e_branchformer] 148.9M 2.14 4.55 1.97 4.26
\arrayrulecolorblack!50 \arrayrulecolorblack MultiConvfusionsubscriptMultiConvfusion\textsc{MultiConv}_{\texttt{fusion}}MultiConv start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT 147.4M 2.15 4.69 1.89 3.92
Table 4: Comparison (accuracy % and F1) of our system against other architectures on the SLU task.
Method Params Intent Classification Entity Recognition
Valid Acc. Test Acc. SLU-F1 Precision Recall
Conformer [conformer] 109M 87.4 86.7 77.2 80.0 74.5
E-Branchformer [e_branchformer] 110M 88.3 87.4 78.5 80.7 76.5
\arrayrulecolorblack!50 \arrayrulecolorblack MultiConvfusionsubscriptMultiConvfusion\textsc{MultiConv}_{\texttt{fusion}}MultiConv start_POSTSUBSCRIPT fusion end_POSTSUBSCRIPT 108M 88.8 87.4 78.9 80.8 77.1

4.2 Analysis of Convolution Kernels

To better understand how the kernels are being utilized, in Figure 4, we visualize the weights learned by the gate in MultiConvweightedsubscriptMultiConvweighted\textsc{MultiConv}_{\texttt{weighted}}MultiConv start_POSTSUBSCRIPT weighted end_POSTSUBSCRIPT (shown in Figure 2(b)) on the test-other split of Librispeech-100h, Tedlium2, and AISHELL datasets. We observe that not every layer benefits from having multiple convolution kernels. While the initial few encoder layers rely mostly on the smaller kernels, the intermediate and final encoder layers reap the most benefits of having multiple convolution kernels.

Table 5: Comparison of the performance (WER %) of our system with varying number and size of convolution kernels.
Kernels Params Librispeech-100h
Test Clean Test Other
K={3,7}𝐾37K=\{3,7\}italic_K = { 3 , 7 } 36.8M 6.15 17.03
K={23,31}𝐾2331K=\{23,31\}italic_K = { 23 , 31 } 37.0M 6.21 17.06
K={7,31}𝐾731K=\{7,31\}italic_K = { 7 , 31 } 36.9M 6.36 17.19
K={7,15,23,31}𝐾7152331K=\{7,15,23,31\}italic_K = { 7 , 15 , 23 , 31 } 37.2M 6.18 17.00
K={7,15,31,63}𝐾7153163K=\{7,15,31,63\}italic_K = { 7 , 15 , 31 , 63 } 37.4M 6.38 17.18
K={37,43,49,55}𝐾37434955K=\{37,43,49,55\}italic_K = { 37 , 43 , 49 , 55 } 37.8M 6.22 17.13

In Table 5, we evaluate the performance of our system by varying the number and size of the convolution kernels. We observe that when convolutions with large kernel sizes (i.e. greater than 32323232) are used, performance is negatively impacted. Performance also diminishes when there is a large gap between the chosen kernel sizes. Finally, using four kernels instead of two further improves performance. Thus in all our experiments, we use kernel sizes K={7,15,23,31}𝐾7152331K=\{7,15,23,31\}italic_K = { 7 , 15 , 23 , 31 }.

5 Conclusion

In this work, we propose Multi-Convformer that utilizes multiple convolution kernels instead of a single fixed convolution as in Conformers. We demonstrate the effectiveness of Multi-Convformer by comparing with Conformer and its variants on multiple datasets (Librispeech-960, Tedlium2, AISHELL and Librispeech-100), diverse modelling paradigms (AED, CTC, RNN-T) and different speech tasks (ASR, SLU). We also conduct ablations and analysis for a more comprehensive understanding of our architecture.

6 Acknowledgements

Our computing resources are supported by PSC Bridges2 and NCSA Delta via ACCESS allocation CIS210014, under National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. Additionally, the third author would like to gratefully acknowledge support from the National Language Translation Mission (NLTM): Bhashini project funded by the Ministry of Electronics and Information Technology (MeitY), Government of India.

7 References

\printbibliography