mybib.bib \defbibheadingbibliography[References] \DeclareSourcemap \maps[datatype=bibtex, overwrite=true] \map \step[fieldsource=booktitle, match=\regexp.*Interspeech.*, replace=Proc. Interspeech] \step[fieldsource=journal, match=\regexp.*INTERSPEECH.*, replace=Proc. Interspeech] \step[fieldsource=booktitle, match=\regexp.*ICASSP.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*icassp_inpress.*, replace=Proc. ICASSP (in press)] \step[fieldsource=booktitle, match=\regexp.*Acoustics,.*Speech.*and.*Signal.*Processing.*, replace=Proc. ICASSP] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Learning.*Representations.*, replace=Proc. ICLR] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Computational.*Linguistics.*, replace=Proc. COLING] \step[fieldsource=booktitle, match=\regexp.*SIGdial.*Meeting.*on.*Discourse.*and.*Dialogue.*, replace=Proc. SIGDIAL] \step[fieldsource=booktitle, match=\regexp.*International.*Conference.*on.*Machine.*Learning.*, replace=Proc. ICML] \step[fieldsource=booktitle, match=\regexp.*North.*American.*Chapter.*of.*the.*Association.*for.*Computational.*Linguistics:.*Human.*Language.*Technologies.*, replace=Proc. NAACL] \step[fieldsource=booktitle, match=\regexp.*Empirical.*Methods.*in.*Natural.*Language.*Processing.*, replace=Proc. EMNLP] \step[fieldsource=booktitle, match=\regexp.*Association.*for.*Computational.*Linguistics.*, replace=Proc. ACL] \step[fieldsource=booktitle, match=\regexp.*Automatic.*Speech.*Recognition.*and.*Understanding.*, replace=Proc. ASRU] \step[fieldsource=booktitle, match=\regexp.*Spoken.*Language.*Technology.*, replace=Proc. SLT] \step[fieldsource=booktitle, match=\regexp.*Speech.*Synthesis.*Workshop.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*workshop.*on.*speech.*synthesis.*, replace=Proc. SSW] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*neural.*information.*processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Advances.*in.*Neural.*Information.*Processing.*, replace=Proc. NeurIPS] \step[fieldsource=booktitle, match=\regexp.*Workshop.*on.* Applications.* of.* Signal.*Processing.*to.*Audio.*and.*Acoustics.*, replace=Proc. WASPAA] \step[fieldsource=publisher, match=\regexp.+, replace=] \step[fieldsource=month, match=\regexp.+, replace=] \step[fieldsource=location, match=\regexp.+, replace=] \step[fieldsource=address, match=\regexp.+, replace=] \step[fieldsource=organization, match=\regexp.+, replace=] \interspeechcameraready \name[affiliation=1]DarshanPrabhu \name[affiliation=2]YifanPeng \name[affiliation=1]PreethiJyothi \name[affiliation=2]ShinjiWatanabe
Multi-Convformer: Extending Conformer with
Multiple Convolution Kernels
Abstract
Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate (WER) improvements.
keywords:
Automatic Speech Recognition, Conformer, Multiple Convolutions, CgMLP.1 Introduction
In recent years, end-to-end automatic speech recognition (ASR) systems have emerged as the preferred model of choice to obtain state-of-the-art ASR performance. An integral component of such systems is the speech encoder [e2e_asr_survey1, e2e_asr_survey2] that uses Attention via Transformers [attention] to map the input speech into high-level acoustic representations. Transformer-based ASR systems [speech_transformer2, speech_transformer3] often struggle to model local relationships effectively. To address this limitation, Gulati et al. [conformer] proposed the Conformer [conformer_impl] architecture, that combines multi-headed attention with convolutions [convolution, convolution2, convolution3]. With the advent of Conformer models, the idea of using convolutions alongside attention to independently model both local and global relationships has been widely explored [branchformer, e_branchformer, zipformer, leformer].
![Refer to caption](x1.png)
Despite these notable advancements, prior work [branchformer] has shown that the use of fixed-kernel convolutions within these models creates a bottleneck, forcing the model to re-purpose some of its attention heads to function as local information extractors. This negatively impacts the performance of attention, whose primary purpose is to model global information. To address this limitation, in our work, we propose a multiple convolution-based enhancement to the Conformer architecture (that we call Multi-Convformer). Additionally, we incorporate gating, a technique that has proven to be effective in encoder architectures [cgmlp, branchformer, e_branchformer]. By combining these approaches, we achieve significant improvements (up to 8% relative WER improvement) over the original Conformer architecture and perform at par or better than its variants such as CgMLP [cgmlp] and E-Branchformer [e_branchformer]. The use of multiple convolutions in order to generate better local context has been widely adopted in image-related tasks [dynamic_convolution, dynamic_convolution2, dynamic_convolution3, dynamic_convolution4, dynamic_convolution5]. Such enhancements have also been used in speech emotion recognition [ser] and robust ASR [multi_octave, multi_stream]; the latter works use multiple convolutions within fully convolutional or TDNN-style architectures (unlike our work).
![Refer to caption](x2.png)
In summary, our main contributions are as follows:
-
•
We propose Multi-Convformer, a variant of Conformer that uses multiple convolution kernels instead of a fixed kernel convolution to capture local context more effectively.111Our code is available in the ESPnet toolkit.
-
•
We show the effectiveness of our approach by comparing with Conformer and its variants on ASR and Spoken Language Understanding (SLU). We experiment with multiple datasets (Librispeech [librispeech], Tedlium2 [tedlium2], AISHELL [aishell] and SLURP [slurp]) and various modelling paradigms and obtain up to 8% relative WER improvement over Conformer. We also conduct several analyses and ablations to showcase the effectiveness and interpretability of our approach.
2 Methodology
In this work, we experiment with the three most popular ASR architectures: the attention-based encoder-decoder model (AED) [jointctcatt], the encoder-only model (pure CTC) [peng2024owsmctc] and the RNN-Transducer model (RNN-T) [rnnt]. An essential component common to all these architectures is the encoder (Enc) module that maps an input sequence of speech features to a (typically smaller) sequence of contextualized representations . Thereafter, the manner in which is trained to predict the final -length ground-truth token sequence depends on the underlying architecture. AED uses an attention-based decoder and a CTC module to jointly learn a mapping from to . However, pure CTC and RNN-T are encoder-only models that employ CTC [ctc] and RNN-T [rnnt] losses, respectively, to generate . Since our modifications are constrained to the encoder, in subsequent sections, we restrict our discussion only to the composition of the Enc module.
2.1 Multi-Convformer Encoder
Figure 1 illustrates the overall architecture of a single Multi-Convformer encoder layer and Figure 2 gives an overview of how the outputs from multiple convolutions are merged together. The Multi-Convformer encoder layer consists of four blocks that are stacked together and interspersed with layer normalization [layernorm] and residual connections. The two position-wise feed-forward layers aid in refining the point-wise information, while the multi-head attention and convolution are responsible for incorporating contextual information. This stacked architecture has been widely used in prior work [conformer, branchformer, e_branchformer, squeezeformer]. We adopt this same stack, but replace the fixed single-kernel convolution block with a more expressive multi-kernel convolution module that will henceforth be referred to as MultiConv.
Method | Librispeech-100h (WER) | Tedlium2 (WER) | AISHELL (CER) | ||||||||||
# params | Dev Clean | Dev Other | Test Clean | Test Other | # params | Dev | Test | # params | Dev | Test | |||
Attention Encoder Decoder (AED) models | |||||||||||||
Transformer [speech-transformer] | 40.9M | 8.02 | 20.14 | 8.39 | 20.34 | 33.8M | 10.13 | 8.83 | 47.7M | 5.16 | 5.53 | ||
Conformer [conformer] | 37.4M | 6.59 | 17.24 | 6.89 | 17.27 | 33.9M | 9.30 | 7.65 | 54.2M | 4.23 | 4.63 | ||
\arrayrulecolorblack!50 \arrayrulecolorblack | 37.0M | 6.00 | \cellcolorgreen!20 16.56 | 6.33 | \cellcolorgreen!20 16.60 | 33.5M | \cellcolorgreen!20 7.99 | 7.34 | 54.2M | 4.18 | 4.47 | ||
37.1M | 6.17 | 17.03 | 6.48 | 17.36 | 33.6M | 8.15 | 7.41 | 54.4M | 4.40 | 4.67 | |||
37.0M | 6.23 | 17.19 | 6.41 | 17.15 | 33.5M | 8.32 | 7.42 | 54.2M | \cellcolorgreen!20 4.16 | \cellcolorgreen!20 4.46 | |||
37.2M | \cellcolorgreen!20 5.87 | 16.63 | \cellcolorgreen!20 6.18 | 17.00 | 33.7M | 8.01 | \cellcolorgreen!20 7.27 | 54.7M | 4.18 | \cellcolorgreen!20 4.46 | |||
Encoder Only (Pure CTC) models | |||||||||||||
Transformer [speech-transformer] | 28.9M | 12.32 | 27.49 | 12.88 | 28.26 | 27.7M | 11.97 | 11.71 | 28.7M | 6.51 | 7.02 | ||
Conformer [conformer] | 25.4M | 9.33 | 22.60 | 9.76 | 23.10 | 24.2M | 9.65 | 8.80 | 39.9M | 5.98 | 6.59 | ||
\arrayrulecolorblack!50 \arrayrulecolorblack | 25.1M | \cellcolorgreen!20 9.24 | \cellcolorgreen!20 22.26 | \cellcolorgreen!20 9.47 | \cellcolorgreen!20 23.09 | 24.0M | \cellcolorgreen!20 8.81 | \cellcolorgreen!20 8.44 | 40.2M | \cellcolorgreen!20 5.67 | \cellcolorgreen!20 6.01 | ||
RNN Transducer (RNN-T) models | |||||||||||||
Transformer [speech-transformer] | 32.5M | 9.18 | 22.40 | 9.43 | 22.95 | 28.7M | 10.26 | 9.65 | 31.8M | 5.69 | 6.21 | ||
Conformer [conformer] | 28.9M | 6.79 | 18.28 | 7.19 | 18.70 | 25.2M | 8.37 | 7.96 | 43.0M | 4.93 | 5.22 | ||
\arrayrulecolorblack!50 \arrayrulecolorblack | 28.7M | \cellcolorgreen!20 6.60 | \cellcolorgreen!20 17.53 | \cellcolorgreen!20 7.05 | \cellcolorgreen!20 18.23 | 24.9M | \cellcolorgreen!20 8.20 | \cellcolorgreen!20 7.67 | 43.4M | \cellcolorgreen!20 4.65 | \cellcolorgreen!20 5.05 |
As illustrated in Figure 1, in a single encoder layer, the output from the multi-head attention block is first normalized with a layer normalization. Subsequently, it undergoes channel projection to increase its dimensionality from to 222Here . Typically, . Using a higher intermediate dimension has been shown to be an effective strategy for position-wise feedforward layers [positionwise_ff].. To introduce non-linearity to the representation, we apply GELU [gelu] activation. The resulting output, , is then passed through our Multi-kernel Convolutional Spatial Gating Unit (M-CSGU). M-CSGU is a more powerful alternative [branchformer, e_branchformer] to the standard convolution block due to its usage of gates [gated_mlp] along with convolutions. Finally, we employ another channel projection layer that projects the output from back to , followed by dropout for regularization.
Multi-kernel Convolutional Spatial Gating Unit (M-CSGU): M-CSGU module takes as its input. We first bifurcate each representation in into two parts, each having dimension . Only one part undergoes layer normalization and passes through multiple convolutions; the other part stays intact. These parts are multiplied element-wise, thus creating a gate. Next, we use depthwise convolutions with kernel sizes . Formally, these operations are as follows:
(1) |
where , , represents element-wise products, refers to a depthwise convolution with kernel size of and is the final output from this block which is further passed to a position-wise feed-forward layer. Since Fusion must preserve the dimensionality of the input, the number of input channels to each Conv becomes , however the size of the output channels depends on the nature of the operation. We explore four different fusion mechanisms, shown in Figure 2, that we can further group into two categories: Sum-based and Concat-based Fusion.
Sum-based Fusion: In this mechanism, both the input and output channels are of the same size. The outputs obtained from each convolution are combined using an element-wise addition operation as shown in Figure 2(a). That is, in Equation 1 is computed as: . In our experiments, we refer to this fusion approach as . We further enhance this fusion by learning weights that decide the importance of each convolution for every frame of the input, as shown in Figure 2(b) and defined formally below:
where is a projection layer that projects -dimensional inputs to -dimensional outputs and is the importance given by the frame to the output of the kernel. We refer to this fusion mechanism as in our experiments.
Concat-based Fusion: In contrast to sum, the concat-based fusion allocates a portion of the output to each convolution, causing the number of input and output channels for these convolutions to be different. This reconstruction allocates of the total number of features to each convolution, resulting in the number of output channels to be as shown in Figure 2(c). This operation is shown below:
where . In our experiments, we call this . We further enhance this fusion by introducing a depthwise convolution after the frame reconstruction to take neighboring frames into account while combining the convolution outputs, as shown in Figure 2(d). We refer to this architecture as in our experiments.
3 Experimental Setup
We conduct experiments on five datasets namely Librispeech-100h (LS-100) [librispeech], Librispeech-960h (LS-960) [librispeech], Tedlium-2 [tedlium2], AISHELL-1 [aishell] and SLURP [slurp]. We use ESPnet toolkit [espnet] to run all our experiments on a combination of NVIDIA RTX A6000 and NVIDIA Tesla V100 GPUs.333We ensure same GPU and environment is used while running all the experiments on a particular dataset. All our models take -dimensional log-Mel features as input that are extracted with a 25ms window size and 10ms stride. We also use 3-way speed perturbation with ratios and SpecAugment [specaug]. In all our experiments, we use the experimental settings recommended in ESPnet recipes. For all datasets except LS-960 and SLURP, the encoder-decoder architecture consists of encoder and decoder layers with an attention dimension of and attention heads. However, for LS-960 and SLURP, we use an layer encoder with attention heads and an attention dimension of .
4 Experimental Results and Analysis
Table 2.1 compares all four variants of our proposed Multi-Convformer (elaborated in Section 2.1) to Transformer [speech-transformer] and Conformer [conformer] models. Our method significantly outperforms both these approaches across all three datasets and multiple modelling paradigms; results that are statistically significant are shown with . Additionally, we find that among the four Fusion methods (Figure 2), and perform better. Henceforth, we will only show results with using the fusion strategy.
4.1 ASR and SLU Experiments
Method | Librispeech-100h | Tedlium2 | AISHELL | ||
Test Clean | Test Other | Test | Test | ||
Attention Encoder Decoder (AED) models | |||||
CgConv | 6.36 | 17.18 | 7.44 | 4.55 | |
E-Branchformer [e_branchformer] | 6.50 | 17.06 | 7.34 | 4.47 | |
\arrayrulecolorblack!50 \arrayrulecolorblack | 6.18 | 17.00 | 7.27 | 4.46 | |
Encoder Only (Pure CTC) models | |||||
CgConv | 9.66 | 23.79 | 8.78 | 6.33 | |
E-Branchformer [e_branchformer] | 10.05 | 23.87 | 8.78 | 6.04 | |
\arrayrulecolorblack!50 \arrayrulecolorblack | 9.47 | 23.09 | 8.44 | 6.01 | |
RNN Transducer (RNN-T) models | |||||
CgConv | 6.95 | 18.18 | 7.72 | 5.24 | |
E-Branchformer [e_branchformer] | 6.87 | 18.09 | 7.71 | 5.17 | |
\arrayrulecolorblack!50 \arrayrulecolorblack | 7.05 | 18.23 | 7.67 | 5.05 |
![Refer to caption](extracted/5710291/figures/diagonality.png)
Comparison with Conformer variants. Table 4.1 shows the WER comparison between our system and two Conformer variants: CgConv (Conformer with convolution replaced by Convolutional Spatial Gating Unit [cgmlp])444This is a stronger variant of the original CgMLP architecture [cgmlp], which is a pure MLP-based system. and E-Branchformer (Conformer with disentangled attention and convolution branches) [e_branchformer, comparative_study_branchformer]. We note here that can be considered as an improved version of CgConv, as it reduces to CgConv when a single fixed convolution kernel is employed instead of multiple kernels. Further, we find that the use of a gate in conjunction with convolution allows the model to selectively utilize convolution, thereby introducing a natural branching capability. As a result, our proposed architecture can be viewed as an enhancement to CgConv and a parameter-efficient variant of Branchformer that achieves comparable or better performance on speech-related tasks.
In Figure 3, we compare the diagonal properties of self-attention blocks among Transformer, Conformer, and Multi-Convformer via the diagonality metric [attnetion_usefullness, branchformer]. This metric is an indication of the degree to which attention heads focus on capturing local information rather than global information. We find that Multi-Convformer allows for more global self-attention blocks, resulting in a decreased diagonality value when compared to both Transformer and Conformer, yielding 17% and 3% relative reductions in average diagonality values across all layers, respectively.
Performance on Librispeech-960h. Table 4.1 compares WERs of our proposed system with other architectures on the full Librispeech [librispeech] dataset. For baselines, we reuse the numbers reported by Peng et al. [e_branchformer]. We evaluate with and without the use of an external language model (LM) during inference. In both settings, our model is on par with state-of-the-art architectures achieving comparable or slightly better performance when trained with large amounts of data.
SLU experiments. To evaluate the effectiveness of our proposed approach on tasks other than ASR, in Table 4.1 we evaluate our method on SLU with the SLURP [slurp] dataset. Multi-Convformer outperforms both Conformer and E-Branchformer achieving the best accuracy and F1-score for both intent classification and entity recognition tasks, while having the least number of parameters.
Summary of results. In small-scale training data scenarios for ASR (Librispeech-100h, Tedlium2 and AISHELL shown in Table 2.1) and SLU (SLURP), Multi-Convformer performs consistently better than Conformer. Moreover, our proposed system exhibits comparable or improved performance when compared to other state-of-the-art Conformer variants such as CgMLP and E-Branchformer. With large scale datasets like Librispeech-960h, we find Multi-Convformer to be on par with all these architectures.
![Refer to caption](x3.png)
Kernels | Params | Without LM | With LM | ||||
Test Clean | Test Other | Test Clean | Test Other | ||||
Conformer [conformer] | 147.8M | 2.16 | 4.74 | 1.84 | 3.95 | ||
Branchformer [branchformer] | 146.7M | 2.25 | 4.83 | 1.93 | 4.00 | ||
E-Branchformer [e_branchformer] | 148.9M | 2.14 | 4.55 | 1.97 | 4.26 | ||
\arrayrulecolorblack!50 \arrayrulecolorblack | 147.4M | 2.15 | 4.69 | 1.89 | 3.92 |
Method | Params | Intent Classification | Entity Recognition | |||||
Valid Acc. | Test Acc. | SLU-F1 | Precision | Recall | ||||
Conformer [conformer] | 109M | 87.4 | 86.7 | 77.2 | 80.0 | 74.5 | ||
E-Branchformer [e_branchformer] | 110M | 88.3 | 87.4 | 78.5 | 80.7 | 76.5 | ||
\arrayrulecolorblack!50 \arrayrulecolorblack | 108M | 88.8 | 87.4 | 78.9 | 80.8 | 77.1 |
4.2 Analysis of Convolution Kernels
To better understand how the kernels are being utilized, in Figure 4, we visualize the weights learned by the gate in (shown in Figure 2(b)) on the test-other split of Librispeech-100h, Tedlium2, and AISHELL datasets. We observe that not every layer benefits from having multiple convolution kernels. While the initial few encoder layers rely mostly on the smaller kernels, the intermediate and final encoder layers reap the most benefits of having multiple convolution kernels.
Kernels | Params | Librispeech-100h | |
Test Clean | Test Other | ||
36.8M | 6.15 | 17.03 | |
37.0M | 6.21 | 17.06 | |
36.9M | 6.36 | 17.19 | |
37.2M | 6.18 | 17.00 | |
37.4M | 6.38 | 17.18 | |
37.8M | 6.22 | 17.13 |
In Table 5, we evaluate the performance of our system by varying the number and size of the convolution kernels. We observe that when convolutions with large kernel sizes (i.e. greater than ) are used, performance is negatively impacted. Performance also diminishes when there is a large gap between the chosen kernel sizes. Finally, using four kernels instead of two further improves performance. Thus in all our experiments, we use kernel sizes .
5 Conclusion
In this work, we propose Multi-Convformer that utilizes multiple convolution kernels instead of a single fixed convolution as in Conformers. We demonstrate the effectiveness of Multi-Convformer by comparing with Conformer and its variants on multiple datasets (Librispeech-960, Tedlium2, AISHELL and Librispeech-100), diverse modelling paradigms (AED, CTC, RNN-T) and different speech tasks (ASR, SLU). We also conduct ablations and analysis for a more comprehensive understanding of our architecture.
6 Acknowledgements
Our computing resources are supported by PSC Bridges2 and NCSA Delta via ACCESS allocation CIS210014, under National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. Additionally, the third author would like to gratefully acknowledge support from the National Language Translation Mission (NLTM): Bhashini project funded by the Ministry of Electronics and Information Technology (MeitY), Government of India.