\interspeechcameraready\name

[affiliation=1,2]BolajiYusuf \name[affiliation=2]Jan“Honza” Cernocky \name[affiliation=1]MuratSaraclar

Pretraining End-to-End Keyword Search
with Automatically Discovered Acoustic Units

Abstract

End-to-end (E2E) keyword search (KWS) has emerged as an alternative and complimentary approach to conventional keyword search which depends on the output of automatic speech recognition (ASR) systems. While E2E methods greatly simplify the KWS pipeline, they generally have worse performance than their ASR-based counterparts, which can benefit from pretraining with untranscribed data. In this work, we propose a method for pretraining E2E KWS systems with untranscribed data, which involves using acoustic unit discovery (AUD) to obtain discrete units for untranscribed data and then learning to locate sequences of such units in the speech. We conduct experiments across languages and AUD systems: we show that finetuning such a model significantly outperforms a model trained from scratch, and the performance improvements are generally correlated with the quality of the AUD system used for pretraining.

keywords:
keyword search, spoken term detection, acoustic unit discovery

1 Introduction

Productive use of the internet relies heavily on the presence and capacity of search engines to efficiently index and search through large quantities of data. Since a significant proportion of that data is in multimedia form, it is natural to develop technologies to allow efficient search through non-textual documents. Keyword search (KWS), also known as spoken term detection (STD), is one such technology: it aims to locate where in an archive of spoken documents a user-specified query has been uttered. A KWS system takes a written query and returns a list of documents purported to contain the query, timestamps in those documents where the query is located, and scores representing the system’s confidence in its hypotheses.

KWS is traditionally done by conducting text-based retrieval on the output of an automatic speech recognition (ASR) system. Outside settings with very-low recognition error rates where one-best ASR outputs may be sufficient, it is more common to index richer structures like lattices or confusion networks [1, 2, 3, 4, 5], which improve recall by accounting for uncertainty in ASR output.

More recently, ASR-free KWS methods have sought to eschew the ASR and its concomitant complexities [6, 7, 8, 9, 10, 11, 12]. Instead of relying on the output of an ASR system111This refers to indexing and search. Even ASR-free KWS systems generally rely on simple ASR systems to get timing information required for training., a neural network is trained in an end-to-end (E2E) fashion to locate written queries in large spoken archives. We take [12] as a representative of this approach, and use it as our baseline in this work. The KWS model comprises a pair of encoders: a query encoder that takes a query in the form of a sequence of letters and computes a vector representation thereof, and a document encoder for computing a compatible representation of the spoken document. The two are combined via frame-wise inner-products and locations in the document which have high inner-products with the query embedding are returned as hits.

Although E2E KWS methods are able to streamline the indexing and search, they generally trail ASR-based methods in terms of search accuracy. Furthermore, ASR-based systems can benefit from the rise in semi-supervised learning that improve the underlying ASR model by using large amounts of untranscribed speech for pretraining. Our objective in this paper is, therefore, to design a pretraining scheme for E2E KWS to be able to leverage untranscribed speech. We note here that [12] already explored pretraining for E2E-KWS, but it only considered pretraining with transcribed multilingual data, while we explore pretraining with untranscribed data in the target language—with the potential to expand to multilingual untranscribed data.

The input data for training E2E KWS comprises sets of speech documents and the words (queries) they contain. Hence, to pretrain E2E KWS on an untranscribed speech corpus with the same training objective, the challenge is to get the list of queries which we can pretrain the KWS model to locate. In other words, we need sequences of discrete units corresponding] to the spoken content.

Acoustic unit discovery (AUD) aims to solve this exact problem—automatically discovering an inventory of phone-like units for a language from completely unlabeled data. Several works have tackled AUD, including Bayesian methods [13, 14, 15, 16], neural-network-based methods [17, 18, 19] or hybrids thereof [20, 21].

We employ the Hierarchical Subspace Hidden Markov Model (H-SHMM) [22, 23], a non-parametric Bayesian model for AUD. H-SHMM models follows the phone-loop AUD paradigm [13, 16, 24], where each acoustic unit is modeled as an HMM. In H-SHMM, the HMM parameters are constrained to a phonetic subspace of the parameter space and the parameters that define the subspace itself are allowed to vary per language within a constrained “hyper”-subspace. We choose H-SHMM as it has good performance not just on intrinsic AUD but also unsupervised word segmentation [25], which makes it suitable for our task of word localization. After training the H-SHMM, we use it to label the untranscribed speech. Then we use sequences of acoustic unit labels as pseudo-queries for pretraining the model. Finally, we finetune the model on a small amount of transcribed data.

We conduct experiments on the English Libri-light [26] and Turkish Broadcast News [27] corpora. Our experiments show that AUD-based pretraining significantly improves KWS performance for AUD that uses MFCC features, with improvements that are correlated with the AUD system’s phonetic correspondence. Furthermore, when the AUD uses pretrained transformer features as input, we still get significant improvements on Libri-light.

2 Methods

Refer to caption
(a) Pretraining.
Refer to caption
(b) Finetuning.
Figure 1: E2E KWS system during pretraining and finetuning. Grey boxes: non-trainable components, white boxes: components which are trained from scratch, blue boxes: trainable components whose parameters are transferred from a pretrained model.

2.1 Background

2.1.1 End-to-end Keyword Search

Our work is based on the E2E KWS model of [12]. This model ingests a textual query in the form of a sequence of L𝐿Litalic_L letters, 𝐪=(q1,,qL)𝐪subscript𝑞1subscript𝑞𝐿\bm{\mathbf{q}}=(q_{1},\dots,q_{L})bold_q = ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), and a spoken utterance of N𝑁Nitalic_N frames, 𝐗=(𝐱1,,𝐱N)𝐗subscript𝐱1subscript𝐱𝑁\bm{\mathbf{X}}=(\bm{\mathbf{x}}_{1},\dots,\bm{\mathbf{x}}_{N})bold_X = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) and predicts the sequence 𝐲(𝐪,𝐗)=(y1,,yN)𝐲𝐪𝐗subscript𝑦1subscript𝑦𝑁\bm{\mathbf{y}}(\bm{\mathbf{q}},\bm{\mathbf{X}})=(y_{1},\dots,y_{N})bold_y ( bold_q , bold_X ) = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), where each yn{0,1}subscript𝑦𝑛01y_{n}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ { 0 , 1 } is a binary random variable indicating the existence of the query in the n𝑛nitalic_nth frame of the document, i.e:

yn={1,if𝐪 is spoken in 𝐗 in a time span including n0,otherwise.subscript𝑦𝑛cases1if𝐪 is spoken in 𝐗 in a time span including 𝑛0otherwise\displaystyle y_{n}=\begin{cases}1,&\text{if}\ \bm{\mathbf{q}}\text{ is spoken% in }\bm{\mathbf{X}}\text{ in a time span including }n\\ 0,&\text{otherwise}.\end{cases}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL if bold_q is spoken in bold_X in a time span including italic_n end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW (1)

The model comprises a pair of encoders:

  • The query encoder computes a fixed-length representation, 𝐞𝐪subscript𝐞𝐪\bm{\mathbf{e}}_{\bm{\mathbf{q}}}bold_e start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT, of the query by passing it through a GRU, summing the activations from the last layer of the GRU across the sequence dimension, followed by an affine projection.

  • The document encoder computes a down-sampled representation, of the document 𝐇𝐗=(𝐡1,,𝐡N)subscript𝐇𝐗subscript𝐡1subscript𝐡𝑁\bm{\mathbf{H}}_{\bm{\mathbf{X}}}=(\bm{\mathbf{h}}_{1},\dots,\bm{\mathbf{h}}_{% N})bold_H start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT = ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) by passing it through a BLSTM followed by an affine transform.

The sequence 𝐳(𝐪,𝐗)=(z1,,zN)𝐳𝐪𝐗subscript𝑧1subscript𝑧𝑁\bm{\mathbf{z}}(\bm{\mathbf{q}},\bm{\mathbf{X}})=(z_{1},\dots,z_{N})bold_z ( bold_q , bold_X ) = ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) of occurrence probabilities, znP(yn=1|𝐪,𝐗)subscript𝑧𝑛𝑃subscript𝑦𝑛conditional1𝐪𝐗z_{n}\coloneqq P(y_{n}=1|\bm{\mathbf{q}},\bm{\mathbf{X}})italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≔ italic_P ( italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 | bold_q , bold_X ), is then obtained via a matrix-vector product of 𝐇𝐗subscript𝐇𝐗\bm{\mathbf{H}}_{\bm{\mathbf{X}}}bold_H start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT and 𝐞𝐪subscript𝐞𝐪\bm{\mathbf{e}}_{\bm{\mathbf{q}}}bold_e start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT as:

zn=σ(𝐡n𝐞𝐪),subscript𝑧𝑛𝜎superscriptsubscript𝐡𝑛topsubscript𝐞𝐪\displaystyle z_{n}=\sigma(\bm{\mathbf{h}}_{n}^{\top}\bm{\mathbf{e}}_{\bm{% \mathbf{q}}}),italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_σ ( bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ) , (2)

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the logistic sigmoid function.

The vector of probabilities from (2) is then post-processed to obtain the timestamps in the document hypothesized to contain the query, and the corresponding confidence scores by detecting “islands” of probabilities above 0.50.50.50.5. The procedure is as follows:

  1. 1.

    Probabilities zn<0.5subscript𝑧𝑛0.5z_{n}<0.5italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < 0.5 are zeroed-out.

  2. 2.

    The resulting “islands” of non-zero elements are returned as system hypotheses, and each hypothesis’ confidence score is computed as the median probability in its time-span.

The model is trained with mini-batch gradient descent on a transcribed speech dataset. At each training step, t𝑡titalic_t, the following objective is minimized:

Jt=l=1Lm=1Mf(\displaystyle J_{t}=\sum_{l=1}^{L}\sum_{m=1}^{M}f\Bigl{(}italic_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_f ( 𝐳(𝐪t,l,𝐗t,l,m),𝐲(𝐪t,l,𝐗t,l,m)),\displaystyle\bm{\mathbf{z}}\bigl{(}\bm{\mathbf{q}}_{t,l},\bm{\mathbf{X}}_{t,l% ,m}\bigr{)},\bm{\mathbf{y}}\bigl{(}\bm{\mathbf{q}}_{t,l},\bm{\mathbf{X}}_{t,l,% m}\bigr{)}\Bigr{)},bold_z ( bold_q start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t , italic_l , italic_m end_POSTSUBSCRIPT ) , bold_y ( bold_q start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t , italic_l , italic_m end_POSTSUBSCRIPT ) ) , (3)

where {𝐪t,1𝐪t,L}subscript𝐪𝑡1subscript𝐪𝑡𝐿\{\bm{\mathbf{q}}_{t,1}\dots\bm{\mathbf{q}}_{t,L}\}{ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT … bold_q start_POSTSUBSCRIPT italic_t , italic_L end_POSTSUBSCRIPT } is a mini-batch of L𝐿Litalic_L queries sampled randomly from the set of unigrams, bigrams and trigrams of the dataset; {𝐗t,l,1,,𝐗t,l,M}subscript𝐗𝑡𝑙1subscript𝐗𝑡𝑙𝑀\{\bm{\mathbf{X}}_{t,l,1},\dots,\bm{\mathbf{X}}_{t,l,M}\}{ bold_X start_POSTSUBSCRIPT italic_t , italic_l , 1 end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT italic_t , italic_l , italic_M end_POSTSUBSCRIPT } is a set of documents sampled from the dataset such that {𝐗t,l,1}subscript𝐗𝑡𝑙1\{\bm{\mathbf{X}}_{t,l,1}\}{ bold_X start_POSTSUBSCRIPT italic_t , italic_l , 1 end_POSTSUBSCRIPT } contains 𝐪t,lsubscript𝐪𝑡𝑙\bm{\mathbf{q}}_{t,l}bold_q start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT while the other M1𝑀1M-1italic_M - 1 documents are sampled randomly; and f()𝑓f(\cdot)italic_f ( ⋅ ) is a modified binary cross-entropy function defined as:

f(z,y)=n=1N^𝑓𝑧𝑦superscriptsubscript𝑛1^𝑁\displaystyle f(z,y)=-\sum_{n=1}^{\hat{N}}italic_f ( italic_z , italic_y ) = - ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUPERSCRIPT (𝟙zn>1ϕ(1yn)log(1zn)\displaystyle\Bigl{(}\mathbbm{1}_{z_{n}>1-\phi}\cdot(1-y_{n})\log(1-z_{n})( blackboard_1 start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 1 - italic_ϕ end_POSTSUBSCRIPT ⋅ ( 1 - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) roman_log ( 1 - italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
+𝟙zn<ϕλynlogzn),\displaystyle+\mathbbm{1}_{z_{n}<\phi}\cdot\lambda\cdot y_{n}\log z_{n}\Bigr{)},+ blackboard_1 start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < italic_ϕ end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (4)

with ϕitalic-ϕ\phiitalic_ϕ controlling the tolerance of the objective to easily-classified frames and λ𝜆\lambdaitalic_λ controlling the relative weighting of positive to negative frames. Following [12], we set λ=5𝜆5\lambda=5italic_λ = 5, ϕ=0.7italic-ϕ0.7\phi=0.7italic_ϕ = 0.7 and M=4𝑀4M=4italic_M = 4 in all our experiments.

The word-level alignments required for training are obtained by training an HMM-GMM ASR system on the training data and using it for forced alignment.

2.1.2 Hierarchical Subspace Hidden Markov Model

Acoustic unit discovery entails learning a set of units from untranscribed data. For a language, l𝑙litalic_l, this typically involves learning a set of parameters, Θl={𝜽l,u}u=1UlsuperscriptΘ𝑙superscriptsubscriptsuperscript𝜽𝑙𝑢𝑢1subscript𝑈𝑙\Theta^{l}=\{\bm{\mathbf{\theta}}^{l,u}\}_{u=1}^{U_{l}}roman_Θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = { bold_italic_θ start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each unit u𝑢uitalic_u, which then allows frames of an utterance 𝐗=(𝐱1,,𝐱n)𝐗subscript𝐱1subscript𝐱𝑛\bm{\mathbf{X}}=(\bm{\mathbf{x}}_{1},\dots,\bm{\mathbf{x}}_{n})bold_X = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) in that language to be labeled into discrete units v1,,vNsubscript𝑣1subscript𝑣𝑁v_{1},\dots,v_{N}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT where each vn{1,2,,Ul}subscript𝑣𝑛12subscript𝑈𝑙v_{n}\in\{1,2,\dots,U_{l}\}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ { 1 , 2 , … , italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }.

We employ H-SHMM [22], a Dirichlet-process-based Bayesian nonparametric model for AUD. In H-SHMM, each acoustic unit is a 3-state, left-to-right HMM-GMM with 4 Gaussians per state, and each 𝜽l,usuperscript𝜽𝑙𝑢\bm{\mathbf{\theta}}^{l,u}bold_italic_θ start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT is a super-vector formed by concatenating all the mean vectors, covariance matrix elements and mixture weights of the HMM-GMM. The parameters of the HMM-GMMs are constrained to dwell in a low-dimensional subspace of the full parameter space:

𝜽l,u=g(𝐖l𝜼l,u+𝐛l),superscript𝜽𝑙𝑢𝑔superscript𝐖𝑙superscript𝜼𝑙𝑢superscript𝐛𝑙\displaystyle\bm{\mathbf{\theta}}^{l,u}=g\bigl{(}\bm{\mathbf{W}}^{l}\bm{% \mathbf{\eta}}^{l,u}+\bm{\mathbf{b}}^{l}\bigr{)},bold_italic_θ start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT = italic_g ( bold_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_italic_η start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT + bold_b start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , (5)

where 𝜼l,usuperscript𝜼𝑙𝑢\bm{\mathbf{\eta}}^{l,u}bold_italic_η start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT is a low-dimensional222We set the dimensionality of 𝜼l,usuperscript𝜼𝑙𝑢\bm{\mathbf{\eta}}^{l,u}bold_italic_η start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT to 100 in our experiments. Contrast this to the dimensionality of 𝜽l,usuperscript𝜽𝑙𝑢\bm{\mathbf{\theta}}^{l,u}bold_italic_θ start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT which is around 1000 for HMMs with MFCC features and around 25000 for HMMs with XLS-R features. embedding of the parameter, 𝐖lsuperscript𝐖𝑙\bm{\mathbf{W}}^{l}bold_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is a language-specific low-rank matrix which, along with the bias vector 𝐛lsuperscript𝐛𝑙\bm{\mathbf{b}}^{l}bold_b start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, defines the subspace to which the parameters are constrained, and g()𝑔g(\cdot)italic_g ( ⋅ ) is a non-linear function mapping vectors from the column space of 𝐖lsuperscript𝐖𝑙\bm{\mathbf{W}}^{l}bold_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT to HMM parameters, ensuring e.g. that the dimensions corresponding to each covariance matrix constitute a positive-definite matrix and that the mixture weights are non-negative and sum up to one. The subspace parameters are themselves further constrained to a K𝐾Kitalic_K-dimensional “hyper”-subspace:

𝐖l=𝐌0+k=1Kαkl+𝐌ksuperscript𝐖𝑙subscript𝐌0superscriptsubscript𝑘1𝐾subscriptsuperscript𝛼𝑙𝑘subscript𝐌𝑘\displaystyle\bm{\mathbf{W}}^{l}=\bm{\mathbf{M}}_{0}+\sum_{k=1}^{K}\alpha^{l}_% {k}+\bm{\mathbf{M}}_{k}bold_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = bold_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
𝐛l=𝐦0+k=1Kαkl+𝐦k,superscript𝐛𝑙subscript𝐦0superscriptsubscript𝑘1𝐾subscriptsuperscript𝛼𝑙𝑘subscript𝐦𝑘\displaystyle\bm{\mathbf{b}}^{l}=\bm{\mathbf{m}}_{0}+\sum_{k=1}^{K}\alpha^{l}_% {k}+\bm{\mathbf{m}}_{k},bold_b start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = bold_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (6)

where 𝜶lsuperscript𝜶𝑙\bm{\mathbf{\alpha}}^{l}bold_italic_α start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is a language-specific low-dimensional embedding, and {𝐌k}subscript𝐌𝑘\{\bm{\mathbf{M}}_{k}\}{ bold_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and {𝐦k}subscript𝐦𝑘\{\bm{\mathbf{m}}_{k}\}{ bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } define language-agnostic template matrices and vectors whose linear combinations define the part of the space 𝐖lsuperscript𝐖𝑙\bm{\mathbf{W}}^{l}bold_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝐛lsuperscript𝐛𝑙\bm{\mathbf{b}}^{l}bold_b start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are allowed to occupy. Thus, the subspace parameters are allowed to vary in a controlled manner from one language to another. The H-SHMM defines a distribution with trainable parameters—{𝐌k},{𝐦k},{𝜶l},{𝜼l,u}subscript𝐌𝑘subscript𝐦𝑘superscript𝜶𝑙superscript𝜼𝑙𝑢\{\bm{\mathbf{M}}_{k}\},\{\bm{\mathbf{m}}_{k}\},\{\bm{\mathbf{\alpha}}^{l}\},% \{\bm{\mathbf{\eta}}^{l,u}\}{ bold_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } , { bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } , { bold_italic_α start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } , { bold_italic_η start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT }—which are learned in two phases:

  1. 1.

    Supervised pretraining: The model is first trained on phonetically-transcribed speech from multiple languages not including the target language. This imbues the templates {𝐌k}subscript𝐌𝑘\{\bm{\mathbf{M}}_{k}\}{ bold_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and {𝐦k}subscript𝐦𝑘\{\bm{\mathbf{m}}_{k}\}{ bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } with phone-like characteristics.

  2. 2.

    Acoustic unit discovery: The distributions {𝐌k}subscript𝐌𝑘\{\bm{\mathbf{M}}_{k}\}{ bold_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and {𝐦k}subscript𝐦𝑘\{\bm{\mathbf{m}}_{k}\}{ bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } are kept fixed, and transferred to a target language, lsuperscript𝑙l^{*}italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, for which 𝜶lsuperscript𝜶superscript𝑙\bm{\mathbf{\alpha}}^{l^{*}}bold_italic_α start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and {𝜼l,u}superscript𝜼superscript𝑙𝑢\{\bm{\mathbf{\eta}}^{l^{*},u}\}{ bold_italic_η start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_u end_POSTSUPERSCRIPT } are learned.

Both phases involve optimizing a variational lower bound on the log-likelihood of the data, which yields a Baum-Welch-like training procedure. Having obtained the distributions of the HMM parameters, the untranscribed speech can them be labelled with a variational analog of Viterbi decoding. Interested readers are referred to [23] for a thorough coverage of H-SMMM and its inference.

2.2 Pretraining KWS with AUD

In this paper, we propose using AUD to label an otherwise untranscribed speech corpus into acoustic units, and using these pseudo-labels to pretrain the E2E KWS model in a setting where we have a small transcribed corpus and a larger untranscribed speech corpus in the same language.

We train an H-SHMM on a small subset of the unlabeled speech and use it to transcribe the full corpus into sequences of acoustic units. Since the KWS model expects word sequences as input, and the AUD only returns phone-like units, we form pseudo-words from acoustic unit n-grams. Specifically, we take all sequences of 5 to 15 consecutive acoustic units as pseudo-words, and use these pseudo-words as queries to the KWS model, which we pretrain using (3) to locate them in the corpus. Note that we have the pseudo-word time boundaries for training since decoding with H-SHMM, as with any other HMM, naturally yields frame-level decisions for acoustic units.

After pretraining is complete, we transfer the document encoder and discard the acoustic unit query encoder. We then initialize a new query encoder for actual graphemes and train it along with the transferred document encoder to locate real queries as described in Section 2.1.1.

3 Experiments

3.1 Setup

3.1.1 Datasets

Keyword search: We test the performance on the Libri-light [26] and Turkish Broadcast News (BNTR) [27] corpora. For Libri-light, we use the 10 hour Libri-light training set as the transcribed data and test on the standard Librispeech [28] sets (dev-clean, test-clean, dev-other and dev-other). To match the Libri-light training data size, we use a 10-hour subset of BNTR from the VOA programs333https://catalog.ldc.upenn.edu/LDC2012S06 for training and select two 10-hour subsets from the remaining BNTR data as dev and test sets. Since neither dataset has official query lists, we randomly select 1500 queries composed of equal proportions of unigrams, bigrams and trigrams for each of the dev and eval sets. Table 1 shows the proportion of out-of-vocabulary (OOV) queries.

For Libri-light pretraining, we use the 360-hour set of Librispeech as untranscribed data. In the case of BNTR, we use the full 180-hour BNTR training set for pretraining.

Table 1: OOV rates for the query lists used for each dev/test set.
Dataset LS-clean LS-other BNTR
Dev Test Dev Test Dev Test
OOV-Rate (%) 1.1 2.6 2.5 3.6 11.9 6.3

Acoustic unit discovery: For the supervised phase of H-SHMM training where we estimate the hyper-subspaces ({𝐌k}subscript𝐌𝑘\{\bm{\mathbf{M}}_{k}\}{ bold_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, and {𝐦k}subscript𝐦𝑘\{\bm{\mathbf{m}}_{k}\}{ bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } in Section 2.1.2), we use the models from [23], trained on data from seven languages444https://github.com/beer-asr/beer/tree/master/recipes/hshmm: French, German, Polish and Spanish from the Globalphone corpus [29], and Amharic, Swahili and Wolof from the Alffa project [30]. A 1500-utterance subset of each language (totalling around 19 hours of speech) was used.

For actual acoustic unit discovery, where we learn the acoustic units ({𝜶l},{𝜼l,u}superscript𝜶𝑙superscript𝜼𝑙𝑢\{\bm{\mathbf{\alpha}}^{l}\},\{\bm{\mathbf{\eta}}^{l,u}\}{ bold_italic_α start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } , { bold_italic_η start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT }) for each target language, we use random 3000-utterance subsets from each respective language’s untranscribed data, and use the learned HMMs to transcribe the full corpus.

3.1.2 Acoustic features

The default acoustic inputs to our models (both AUD and KWS) are 13-dimensional MFCC features. In addition, we consider features from a pretrained 300 million parameter XLS-R model [31]555https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt, from which we use the output of the 15th layer, shown in [23] to yield considerably better AUD performance. Note that, due to computational constraints, we only use the XLS-R model as a feature extractor and we do not finetune it.

Table 2: Term weighted value comparison between the baseline and the proposed system. Dev set results are MTWVs, test set results are triplets of Ahlsubscriptsubscript𝐴𝑙{}_{l}A_{h}start_FLOATSUBSCRIPT italic_l end_FLOATSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT where A𝐴Aitalic_A is the ATWV, and l𝑙litalic_l and hhitalic_h are the 2.5th and 97.5th percentile ATWV estimate respectively.
Dataset LS-clean LS-other BNTR AUD-NMI
KWS Feature AUD Feature Dev Test Dev Test Dev Test LS BNTR
MFCC - 38.4 38.740.442.0 16.2 13.014.315.7 67.0 69.470.771.9 - -
MFCC MFCC 44.2 44.045.747.4 21.1 18.820.321.9 74.5 76.477.678.6 34.2 27.2
MFCC XLS-R 56.3 54.956.558.2 30.9 27.128.730.4 78.2 80.481.482.3 52.9 41.3
XLS-R - 73.2 71.072.774.4 62.2 61.663.365.0 84.8 85.386.186.9 - -
XLS-R XLS-R 75.8 74.776.277.6 65.5 64.966.467.9 84.0 84.785.586.2 52.9 41.3

3.1.3 Metrics

Term Weighted Value: In our experiments, we report the term weighted values (TWV) [32, 33], which is a measure of weighted recall and precision averaged across queries. The TWV of a system on a set of queries, 𝒬𝒬\mathcal{Q}caligraphic_Q, at a threshold, ζ𝜁\zetaitalic_ζ, is defined as:

TWV=100×(11𝒬q𝒬(Pmiss(q,ζ)+βPFA(q,ζ))),TWV10011𝒬subscript𝑞𝒬subscript𝑃𝑚𝑖𝑠𝑠𝑞𝜁𝛽subscript𝑃𝐹𝐴𝑞𝜁\displaystyle\text{TWV}=100\times\bigl{(}1-\frac{1}{\mathcal{Q}}\sum_{q\in% \mathcal{Q}}(P_{miss}(q,\zeta)+\beta P_{FA}(q,\zeta))\bigr{)},TWV = 100 × ( 1 - divide start_ARG 1 end_ARG start_ARG caligraphic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT ( italic_q , italic_ζ ) + italic_β italic_P start_POSTSUBSCRIPT italic_F italic_A end_POSTSUBSCRIPT ( italic_q , italic_ζ ) ) ) , (7)

where Pmiss(q,ζ)subscript𝑃𝑚𝑖𝑠𝑠𝑞𝜁P_{miss}(q,\zeta)italic_P start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT ( italic_q , italic_ζ ) is the probability of misses, PFA(q,ζ)subscript𝑃𝐹𝐴𝑞𝜁P_{FA}(q,\zeta)italic_P start_POSTSUBSCRIPT italic_F italic_A end_POSTSUBSCRIPT ( italic_q , italic_ζ ) is the probability of false alarms and β𝛽\betaitalic_β is a parameter controlling the relative importance of the two. Following prior NIST evaluations [32], we set β=999.9𝛽999.9\beta=999.9italic_β = 999.9. The threshold ζ𝜁~{}\zetaitalic_ζ is tuned on the dev sets. We report the maximum term weighted value (MTWV)—the TWV at the threshold which maximizes it—for the dev sets, and the actual term weighted value (ATWV)—computed by using the threshold tuned on the corresponding dev set—for the test sets. We adopt keyword-specific thresholding for across-query score normalization [34] in order to allow various queries with different score distributions to be compared with a single global threshold.

Normalized Mutual Information: We also report normalized mutual information (NMI) for the AUD systems in order to see how KWS performance correlates with the intrinsic quality of the AUD system used for pretraining. NMI is computed by normalizing the mutual information between discovered units 𝒰𝒰\mathcal{U}caligraphic_U and reference phones 𝒫𝒫\mathcal{P}caligraphic_P by the sum of their entropies:

NMI(𝒫,𝒰)=200×I(𝒫;𝒰)H(𝒫)+H(𝒰)%.NMI𝒫𝒰200percent𝐼𝒫𝒰𝐻𝒫𝐻𝒰\displaystyle\text{NMI}(\mathcal{P},\mathcal{U})=200\times\frac{I(\mathcal{P};% \mathcal{U})}{H(\mathcal{P})+H(\mathcal{U})}\%.NMI ( caligraphic_P , caligraphic_U ) = 200 × divide start_ARG italic_I ( caligraphic_P ; caligraphic_U ) end_ARG start_ARG italic_H ( caligraphic_P ) + italic_H ( caligraphic_U ) end_ARG % . (8)

NMI takes values in [0,100]0100[0,100][ 0 , 100 ], with 0 denoting completely uncorrelated units and 100 denoting perfect match.

3.1.4 Model configuration and hyper-parameters

We base the architecture of our model on [12]666Code available at: https://github.com/bolajiy/golden-retriever. The query encoder is a network with a 32-dimensional embedding layer for computing vector representations of each input grapheme, followed by 2 bidirectional GRU layers with 256 output units per direction per layer, and a 400-dimensional output projection layer whose outputs are summed along the sequence dimension to obtain the vectoral query representation.

The document encoder has 6 BLSTM layers with 512-dimensional output per direction per layer, followed by a 400-dimensional output layer. We apply dropout of 0.4 between successive BLSTM layers, and down-sample by a factor of 2 after the fourth BLSTM layer. This results in document encodings with frame durations of 40ms for XLS-R features and 20ms for MFCC.

The H-SHMM use 3-state HMM-GMMs with 4 diagonal-covariance Gaussians per HMM state, Dirichlet process with truncation parameter of 1, 100-dimensional unit embeddings (𝜼l,usuperscript𝜼𝑙𝑢\bm{\mathbf{\eta}}^{l,u}bold_italic_η start_POSTSUPERSCRIPT italic_l , italic_u end_POSTSUPERSCRIPT) and 5-dimensional language embeddings (𝜶lsuperscript𝜶𝑙\bm{\mathbf{\alpha}}^{l}bold_italic_α start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT).

3.1.5 Training details

The neural networks are trained with the Adam optimizer [35]. For pretraining, we use a cosine decay schedule with peak learning rate of 5e-4 and final learning rate of 1e-7, and train for 200k steps using mini-batch size of 256, with 10k warmup steps. For finetuning (and the baseline with no pretraining), we train with the same step-based learning rate scheduling as [12] using mini-batch size of 32. This starts with a learning rate of 0.002, halved whenever the validation loss (computed over 10% of the training queries) does not improve over 4 epochs. The training is stopped when validation loss does not improve for 10 epochs.

3.2 Results

Table 2 shows the term weighted values of KWS with and without the proposed pretraining scheme, as well as the intrinsic AUD metrics.

When MFCC are used as input features to KWS, we find that pretraining with AUD learned pseudo-queries leads to significant improvements across all dev and test sets—with 5.3, 6.0 and 6.9 ATWV improvements respectively on the Librispeech-clean, Librispeech-other and BNTR test sets. Furthermore, using XLS-R features for AUD (improving the quality of the learned acoustic units by a considerable margin) leads further significant ATWV improvements when compared to pretraining with MFCC-based acoustic units.

Replacing the KWS input with XLS-R unsurprisingly results in a much better performance, even compared to the pretrained MFCC-based systems, especially on the acoustically difficult Librispeech-other sets. Furthermore, when we pretrain the XLS-R based KWS system with AUD labels, we observe +3.5 and +3.1 ATWV respectively on the Librispeech test-clean and test-other sets and no improvement on the BNTR.

4 Conclusions

In this paper, we have proposed a pretraining scheme for end-to-end keyword search. Our approach uses acoustic unit discovery to label untranscribed data and construct pseudo-queries used to pretrain the KWS model, before finetuning on a small trasncribed dataset. Our experiments show that pretraining can significantly improve KWS performance.

We envision future work doing larger scale and multilingual pretraining with acoustic unit targets in conjunction with or in place of transcribed data. Another direction is to explore using more sophisticated word segmentation methods instead of our naive use of all n-grams of acoustic units.

5 Acknowledgements

The work was supported by Czech Ministry of Interior projects Nos. VJ01010108 “ROZKAZ” and by European Union’s Horizon Europe project No. SEP-210943216 “ELOQUENCE”. Computing on IT4I supercomputer was supported by the Ministry of Education, Youth and Sports of the Czech Republic through e-INFRA CZ (ID:90254). Computing on the ROYAL compute server was supported by the Turkish Directorate of Strategy and Budget under the ROYAL Project (CB SBB 2019K12-149250).

References

  • [1] K. Ng and V. W. Zue, “Subword-based approaches for spoken document retrieval,” Speech Communication, vol. 32, no. 3, pp. 157–186, 2000.
  • [2] I. Szöke, M. Fapšo, and L. Burget, “Hybrid word-subword decoding for spoken term detection,” in The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2008, pp. 42–48.
  • [3] D. Can and M. Saraçlar, “Lattice indexing for spoken term detection,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 8, pp. 2338–2347, 2011.
  • [4] C. Chelba, T. J. Hazen, and M. Saraçlar, “Retrieval and browsing of spoken content,” IEEE Signal Processing Magazine, vol. 25, no. 3, pp. 39–49, 2008.
  • [5] L. Mangu, B. Kingsbury, H. Soltau, H.-K. Kuo, and M. Picheny, “Efficient spoken term detection using confusion networks,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE, 2014, pp. 7844–7848.
  • [6] K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kingsbury, “End-to-end ASR-free keyword search from speech,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1351–1359, 2017.
  • [7] B. Gündoğdu, B. Yusuf, and M. Saraçlar, “Joint learning of distance metric and query model for posteriorgram-based keyword search,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1318–1328, 2017.
  • [8] B. Yusuf, B. Gundogdu, and M. Saraclar, “Low Resource Keyword Search with Synthesized Crosslingual Exemplars,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 7, pp. 1126–1135, 2019.
  • [9] B. Yusuf, A. Gok, B. Gundogdu, and M. Saraclar, “End-to-End Open Vocabulary Keyword Search,” in Proc. Interspeech 2021, 2021, pp. 4388–4392.
  • [10] J. Švec, L. Šmídl, J. V. Psutka, and A. Pražák, “Spoken Term Detection and Relevance Score Estimation Using Dot-Product of Pronunciation Embeddings,” in Proc. Interspeech 2021, 2021, pp. 4398–4402.
  • [11] T. S. Fuchs, Y. Segal, and J. Keshet, “CNN-Based Spoken Term Detection and Localization without Dynamic Programming,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6853–6857.
  • [12] B. Yusuf, J. Černocký, and M. Saraçlar, “End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, no. 08, pp. 3070–3080, 2023.
  • [13] C.-y. Lee and J. Glass, “A nonparametric Bayesian approach to acoustic model discovery,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1.   Association for Computational Linguistics, 2012, pp. 40–49.
  • [14] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study,” in Proc. Interspeech 2015, 2015, pp. 3189–3193.
  • [15] M. Heck, S. Sakti, and S. Nakamura, “Iterative training of a dpgmm-hmm acoustic unit recognizer in a zero resource scenario,” in 2016 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2016, pp. 57–63.
  • [16] L. Ondel, L. Burget, and J. Černockỳ, “Variational inference for acoustic unit discovery,” Procedia Computer Science, vol. 81, pp. 80–86, 2016.
  • [17] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech representation learning using wavenet autoencoders,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 12, pp. 2041–2053, 2019.
  • [18] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in 8th International Conference on Learning Representations, ICLR, 2020.
  • [19] D. Harwath, W. Hsu, and J. R. Glass, “Learning hierarchical discrete linguistic units from visually-grounded speech,” in 8th International Conference on Learning Representations, ICLR, 2020.
  • [20] J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and B. Raj, “Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery,” in Interspeech, 2017, pp. 488–492.
  • [21] T. Glarner, P. Hanebrink, J. Ebbers, and R. Haeb-Umbach, “Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery,” in Interspeech, 2018, pp. 2688–2692.
  • [22] B. Yusuf, L. Ondel, L. Burget, J. Černocký, and M. Saraçlar, “A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3710–3714.
  • [23] L. Ondel, B. Yusuf, L. Burget, and M. Saraçlar, “Non-parametric bayesian subspace models for acoustic unit discovery,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1902–1917, 2022.
  • [24] L. Ondel, H. K. Vydana, L. Burget, and J. Černocký, “Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery,” in Interspeech, 2019, pp. 261–265.
  • [25] Z. M. Boito, B. Yusuf, F. A. L. Y. Ondel, A. Villavicencio, and L. Besacier, “Unsupervised word segmentation from discrete speech units in low-resource settings,” in Proceedings of the the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages.   European Language Resources Association, 2022, pp. 1–9.
  • [26] J. Kahn et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7669–7673, https://github.com/facebookresearch/libri-light.
  • [27] E. Arisoy, D. Can, S. Parlak, H. Sak, and M. Saraçlar, “Turkish broadcast news transcription and retrieval,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 5, pp. 874–883, 2009.
  • [28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  • [29] T. Schultz, N. T. Vu, and T. Schlippe, “Globalphone: A multilingual text & speech database in 20 languages,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 8126–8130.
  • [30] L. Besacier et al., “Speech technologies for African languages: example of a multilingual calculator for education,” in Interspeech, 2015, pp. 1886–1887.
  • [31] A. Babu et al., “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” in Proc. Interspeech 2022, 2022, pp. 2278–2282.
  • [32] J. G. Fiscus, J. Ajot, J. S. Garofolo, and G. Doddingtion, “Results of the 2006 spoken term detection evaluation,” in Proceedings of the ACM SIGIR Workshop on Searching Spontaneous Conversational Speech, 2007, pp. 51–57.
  • [33] S. Wegmann, A. Faria, A. Janin, K. Riedhammer, and N. Morgan, “The TAO of ATWV: Probing the mysteries of keyword search performance,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 192–197.
  • [34] D. R. Miller, M. Kleber, C.-L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term detection,” in Interspeech, 2007, pp. 314–317.
  • [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.