Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

Abstract

In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.

Index Terms— Large language models, speech processing, agent, program generation

1 Introduction

Nowadays, large language models (LLMs) have impacted the AI research community with their strong capabilities across a wide variety of natural language processing (NLP) tasks involving complicated reasoning [1, 2, 3], planning [4, 5, 6], and self-reflection [7, 8, 9]. These extraordinary abilities have established LLMs as powerful tools for humans and have cemented their pivotal role in recent AI research.

Especially, the potential of employing LLMs as an “assistant” or an “agent” [10, 11, 12] has been extensively explored, with several LLM-based agents that can use tools, e.g. API calls, to solve tasks across diverse domains and various modalities [13, 14, 15, 16] being proposed recently. However, we notice that: 1) Most of the agents in prior works rely on the pre-existing toolsets, which require significant manual efforts to collect and maintain. Only a few works [17, 18, 19] explore the toolset construction process of LLM-based agents. 2) The development for speech-processing agents remains limited, restricting broader and more convenient applications of speech-processing technologies. These motivate us to start with a systematic methodology for speech-processing toolset construction that goes beyond human brainstorming and develop an LLM-based agent for speech-processing applications.

In this paper, we introduce Speech-Copilot, a general framework consisting of two main components: 1) a toolset construction method leveraging LLMs with minimal human efforts, and 2) an LLM-based agent serving as a scalable, interpretable, and flexible interface capable of solving a wide variety of speech-processing tasks via program generation.

For the toolset construction, we propose a pipeline employing an LLM to analyze a diverse set of pre-collected task instructions that can be either collected from humans or synthesized by LLMs, identify the corresponding speech-processing tasks, and decompose these tasks into sub-tasks. This results in a set of unique and basic sub-tasks, which are subsequently formulated as code modules by LLMs and implemented by humans with suitable speech models. This approach enables near-automatic toolset development, significantly reducing the required manual effort while ensuring effectiveness and avoiding redundancy. Additionally, it is quite flexible and scalable, as users can freely choose the speech models they prefer for each module or add new modules if necessary.

An LLM-based agent capable of utilizing these modules via programming has been developed. Our results show that the developed agent can solve various tasks by appropriately combining the basic modules, achieving state-of-the-art performance on Dynamic-SUPERB [20] compared with baselines including large audio-language models and cascaded systems. This validates the efficacy of Speech-Copilot. We also find that Speech-Copilot has strong multi-task ability that can deal with several tasks in a single user query without sacrificing the performances. A demo page is available¹¹1Demo at: https://sites.google.com/view/slt2024-demo-page. After review, we will release all the related code for Speech-Copilot for the community to use, hoping to make it more convenient for everyone to perform speech-processing tasks. Overall, our contributions are:

1.

Proposing a new toolset construction framework for LLM-based agents that requires minimal human efforts only.
2.

Building up a new speech-processing agent with LLMs, which achieves impressive benchmark performances.
3.

Releasing the agent as a public speech-processing toolkit.

Refer to caption — Fig. 1: Overview of Speech-Copilot with the toolset construction and the program generation phases. During the toolset construction, we first conduct task decomposition to decompose diverse speech-processing task instructions into fundamental sub-tasks. Next, task modularization is performed to transform the sub-tasks into documented modules with LLM, manually implemented with scientifically grounded models. Finally, in the program generation phase, programs are generated by LLM based on the user query and executed on the audio input to get the result. Please refer to the demo page1 for more details about prompts.

2 Related works

2.1 Tool utilization of LLMs

Large language models (LLMs) have been proven to be highly effective in many natural language processing (NLP) tasks. By integrating external tools, LLMs can enhance their functionality and handle a wider range of tasks using the additional knowledge and capabilities [21, 22, 23, 24, 25, 26, 27, 28, 29]. For example, AnyTool [16] is an LLM-based agent that uses various APIs to answer user queries across different domains, such as providing specific information about a book, movie, or product, or offering personalized recommendations through connected recommendation APIs. Similarly, ViperGPT [14] employs LLMs for complex visual tasks by generating Python programs that coordinate vision-and-language models to process visual queries.

However, despite the advancements in NLP and computer vision, the exploration of using LLMs in the speech domain to integrate various speech modules and foundation models, particularly via program generation for speech/audio tasks, is relatively limited.

Moreover, while LLM-based agents have been developing rapidly, the construction of toolsets for these agents remains relatively underexplored. Existing approaches [17, 18, 19] typically create tools at the instance level, where a new tool is created tailored for a single or a few instances, overlooking the high-level similarity of the collected instances. This may introduce redundancy in the created toolset that may be difficult to de-duplicate. In addition, they typically require golden labels for the instances, which make it harder to collect instances.

In contrast, our proposed method takes a holistic approach, utilizing LLMs to identify essential sub-tasks from all collected instructions simultaneously. This approach significantly reduces the redundancy of the constructed toolsets compared to instance-level creation methods. Additionally, since our method only requires instructions that can be easily synthesized, the associated audio files and golden labels are not necessary, making it more data-efficient and simplifying the toolset construction process compared to existing methods.

2.2 Toolkit Applications in Speech Processing

Equipping LLMs with speech-processing toolkits is underexplored compared with NLP and computer vision domains. There is only a limited number of studies in this area. Among them, AudioGPT [15] is notable for using an LLM as a core controller to manage various pre-trained audio and speech models. Upon receiving a user query, AudioGPT analyzes it, classifies it into task families, and assigns an appropriate speech model for the task. The model’s output is then sent back to the user as the system’s response. However, AudioGPT has some limitations: 1) Limited generalizability: It assigns only one model per query, with no collaboration between speech models, which limits its ability to handle complex tasks beyond predefined task families. Additionally, the families are not sufficiently broad and diverse, further restricting the generalizability. For instance, AudioGPT can not deal with Dynamic-SUPERB tasks, which are too complex to be addressed by a single model from its task families and model collection. 2) Lack of flexibility: The response is generated and provided to the user in a black-box manner, disallowing the users to manipulate the model’s behavior and limiting the flexibility.

In contrast, Speech-Copilot effectively addresses these issues through a well-crafted toolset construction and program generation approach, where a toolset with fundamental speech modules is first constructed and the agent can then use these modules as basic blocks and dynamically combine them to solve various tasks based on the user query. This ensures the generalizability and versatility. By solving tasks through programming, Speech-Copilot allows users to modify the program according to their preferences, providing a higher degree of flexibility and enabling behavior manipulation to some extent. Furthermore, analyzing the agent’s reasoning steps in the programs enhances interpretability, making it easier for users to understand how solutions are derived.

3 Methods

The development of Speech-Copilot consists of two phases. The first phase is the toolset construction, in which an LLM is employed first to figure out the underlying common components of the pre-collected task instructions and then modularize the identified components into speech-processing modules. The other phase is program generation, which develops an agent solving various speech-processing tasks by writing a program to utilize the speech-processing modules. The overview of Speech-Copilot is illustrated in Fig. 1, and we detail the phases in Sec. 3.1 and 3.2.

3.1 Toolset Construction

The toolset construction phase involves two steps: task decomposition and task modularization.

3.1.1 Task Decomposition

Task decomposition aims to find out the common components, i.e. sub-tasks, of a wide variety of speech-processing tasks. We start from a set of diverse task instructions, which can be collected from real humans or synthesized with LLMs. Given a set of $N$ distinct instructions $\{I_{i}\}_{i=1}^{N}$ that corresponds to $N$ different speech-processing tasks $\{T_{i}\}_{i=1}^{N}$ , the objective of task decomposition is to construct a set of $M$ sub-tasks $\{\mathcal{T}_{i}\}_{i=1}^{M}$ such that each task $T_{i}$ can be represented as the combination of some sub-tasks, indexed by $J$ $\subseteq$ $\{1,2,\ldots,M\}$ , through a suitable combination function $h_{i}$

	$\displaystyle\forall T_{i},\,i\in\{1,2,\ldots,N\},$	$\displaystyle\exists J\subseteq\{1,2,\ldots,M\}\text{ and }h_{i},$		(1)
		$\displaystyle\text{such that}\quad T_{i}=h_{i}\left(\{\mathcal{T}_{j}\mid j\in J% \}\right)$		(1)

Here, the combination of “sub-tasks” means solving the sub-tasks and combining the corresponding results to solve the target task. For example, the speech-to-text translation (ST) can be solved by first conducting automatic speech recognition (ASR) and passing the transcription to a text-based translation model. Hence we say that the ST task is the combination of ASR and text-based translation, with simple cascading as the combination function.

We adopt an LLM for task decomposition. The employed LLM is first asked to map each instruction $I_{i}$ to the task $T_{i}$ by prompting it to analyze the instruction and identify the corresponding speech-processing task through chain-of-thought reasoning [1]. As for the decomposition, besides the requirement in Eq. 1, it is desirable for the constructed sub-tasks to be fundamental enough to remain useful and transferable for unseen tasks, thereby ensuring the constructed set is compact. To achieve this, instead of using common toolset construction methods that create tools based on single instructions [19], we require the LLM to consider all instructions simultaneously. This approach allows the LLM to identify common components across different tasks and decompose them into sub-tasks that are either shared among several tasks or unique to a single task. De-duplication of sub-tasks is also conducted with self-reflection [7] to further reduce the redundancy. Finally, a set of sub-tasks $\{\mathcal{T}_{i}\}_{i=1}^{M}$ with low redundancy can be constructed, where the intention of the sub-tasks can be interpreted with the reasoning in LLM’s response.

3.1.2 Task Modularization

The task modularization involves transforming the constructed sub-tasks $\{\mathcal{T}_{i}\}_{i=1}^{M}$ from the previous task decomposition step into a set of modules $\{f_{i}\}_{i=1}^{M}$ with an LLM, and implementing those modules with existing speech-processing models. Specifically, as the modules involve solving complex speech-processing tasks rather than simple algorithmic problems, it is not required for the LLM to implement the modules directly. Instead, the modularization requires the LLM to consider all the sub-tasks and formulate each sub-task as a module, where a module is defined to be a code block with detailed documentation that should include (1) the module name, (2) the overall objective of the module, i.e. the task information associated with the module, (3) the input to the module as well as the data type and format, (4) the output of the module along with the data type and format, and (5) example usages of the module demonstrating how it can be used individually or together with other modules.

With in-context learning, the LLM is capable of generating high-quality documentation. This reduces the required efforts of human developers, and the developers can freely modify the documentation according to their preferences or for the convenience of implementation. The constructed modules $\{f_{i}\}_{i=1}^{M}$ will serve as part of the input to the LLM-based agents, as detailed in Sec. 3.2.

The modules are manually implemented with suitable speech models based on the objective of the modules. We adhere to some principles when selecting the models: 1) Scientific support: Only the models with publicly available papers or technical reports will be included. This ensures the reliability of the selected models and their sources. 2) Clear guidelines: As we plan to publicly release Speech-Copilot to the community, the selected models should be accompanied by clear guidelines on how to use them, e.g. the environment setup, the running command, etc, ensuring user-friendliness.

These principles ensure that the speech models included in Speech-Copilot are not only scientifically robust but also practical and easy to use, facilitating widespread adoption and application.

3.2 Program Generation

During the program generation phase, an LLM-based program-generating agent $\pi$ is constructed. Given a textual query $q$ and audio input $a$ from the user that is unseen during the toolset construction phase, the objective of the agent $\pi$ is to generate a program $z$ $=$ $\pi(q|\{f_{i}\}_{i=1}^{M})$ based on the available modules $\{f_{i}\}_{i=1}^{M}$ from Sec. 3.1.2 that solves the target task $T$ specified by the query.

This process involves the following steps: (1) Identifying the task $T$ from the query $q$ , (2) Determining the combination $h$ and a set of sub-tasks $\mathcal{T}_{q}$ $\subseteq$ $\{\mathcal{T}_{i}\}_{i=1}^{M}$ that satisfy $T$ $=$ $h(\mathcal{T}_{q})$ , (3) Selecting the relevant modules $f_{q}$ $=$ $\{f_{i}|\mathcal{T}_{i}\in\mathcal{T}_{q},i\in\{1,2,\ldots,M\}\}$ , and (4) Generating a program $z$ that utilizes the modules $f_{q}$ and integrates their results with the combination $h$ . In practice, we develop the agent $\pi$ by guiding the LLM through these steps, providing the module documentation $\{f_{i}\}_{i=1}^{M}$ , and specifying additional constraints in the prompt. These constraints include requiring the agent to provide reasoning and explanations for how it addresses each step and to generate the program $z$ in a specific format, enhancing the interpretability and ease of parsing the programs generated by the agent. Finally, the result $r$ of the query $q$ can be obtained by executing the program $z$ on the audio input $a$ with a Python interpreter.

4 Experimental Setups

4.1 Evaluation Benchmark

We evaluate Speech-Copilot on Dynamic-SUPERB²²2https://dynamic-superb.github.io/ [20] because of its wide coverage of diverse, complex, and challenging speech and audio tasks. Dynamic-SUPERB is designed to assess universal speech models that perform diverse and complex tasks with both strong instruction-following abilities and complicated speech/audio-related understanding. It includes 55 tasks, involving complicated speech/audio-related reasoning and covering six aspects of the speech and audio modalities: audio, content, degradation, semantic, paralinguistic, and speaker. The aspects are explained as follows:

1.

Audio (AUD): This aspect assesses the model’s ability to interpret audio signals. Tasks in this aspect include detecting the sound of a specific object in an audio clip and classifying the sound into some categories.
2.

Content (CNT): This aspect evaluates the model’s understanding of speech content. Tasks involve speech command identification, language recognition, and so on.
3.

Degradation (DEG): The goal here is to measure the model’s ability to detect noise and reverberation in speech signals. Tasks include predicting the signal-to-noise ratio (SNR) at the utterance level and determining if speech signals are affected by reverberation.
4.

Paralinguistic (PRL): The tasks of this aspect aim to gauge the model’s understanding and the ability to make inferences based on paralinguistic information in speech. The tasks include but are not limited to accent identification, emotion recognition, etc.
5.

Semantic (SEM): The goal is to assess the model’s semantic understanding. This aspect involves intent classification, sarcasm detection, etc.
6.

Speaker (SPK): The aim is to evaluate the model’s capacity to extract speaker-related information. This involves tasks like speaker verification and multi-speaker detection.

Currently, Dynamic-SUPERB tasks are formulated as multiple-choice questions with one and only one golden label per question.

4.2 Metrics

Dynamic-SUPERB tasks are evaluated in accuracy, where a hit occurs when the response of the evaluated model aligns with the golden label. Due to the generative nature of current models, e.g. LLMs or large audio-language models (LALMs) in Sec. 4.3, instead of a single option, the models tend to generate long-form responses that do not follow a certain format, making the conventional exact-match (EM) evaluation unsuitable. To this end, the evaluation policy of Dynamic-SUPERB uses an LLM, whose ability as the automatic evaluator has been studied [30, 31], to determine the alignment between the models’ predictions and the labels.

In this work, GPT-4o³³3The model version employed in this study is gpt-4o-2024-05-13. is used as the evaluator. During the grading process, the original task instructions, the response from the assessed models, and the golden labels are presented in the prompts to the evaluator. We also include some rules in the prompts that the evaluator should strictly follow, and the evaluator is then employed for judgment. The rules include:

1.

As each question in Dynamic-SUPERB has one and only one golden answer, it should be judged incorrect if the models choose no or multiple options, meaning that they should clearly select one and only one option for each question.
2.

The evaluator should provide reasons for their judgment.
3.

The evaluation should be summarized with a single ”Yes/No” in a specific format, where ”Yes” indicates that the prediction aligns with the label, and ”No” indicates it does not.

Besides automatic evaluation with GPT-4o, human verification of the evaluation results is conducted to ensure grading correctness and consistency. Finally, we follow the standard approach of Dynamic-SUPERB and report the average performances of the aspects.

4.3 Baselines

We compare the performances of Speech-Copilot with those of several baseline models to verify the effectiveness of Speech-Copilot, including the toolset construction and program generation. The baselines include several recent publicly available large audio language models (LALMs), e.g. Qwen-Audio-Chat [32], SALMONN [33], LTU-AS [34] and WavLLM [35], and cascaded systems that employ LLM to solve the tasks based on the information from automatic speech recognition (ASR), automatic audio captioning (AAC), and other available speech models. Large audio-language models [36, 32, 33, 34, 35, 37, 38] extend the capabilities of standard large language models by incorporating audio and speech recognition features. This integration enables LALMs to process and respond to tasks involving sound and speech.

On the other hand, in the cascaded systems, we adopt GPT-3.5⁴⁴4The model version employed in this study is gpt-3.5-turbo-0125. [39] as the LLM and Whisper-large-v3 [40] and Qwen-Audio-Chat [32] for ASR and AAC, respectively. The cascaded systems are denoted as “ASR+LLM” and “ASR+AAC+LLM” for those using ASR results only and using both ASR and AAC simultaneously. We also compare Speech-Copilot with the cascaded system that provides all the information from our constructed modules listed in Sec. 5.1, except for the speaker verification module if there’s only one audio input, to the LLM, denoted as “All Attributes + LLM”. This baseline simulates the ultimate cascaded system where the LLM utilizes all the available information to make predictions.

4.4 Setup

Greedy decoding is applied to all the models in our experiments. Regarding the candidates of LLMs, in the stages of toolset construction, program generation, and evaluation, GPT-4o is adopted because of its strong language capabilities. On the other hand, when executing the generated programs, GPT-3.5 is employed if querying LLM is required for certain modules. This choice balances cost and model performance, i.e. we can use a more powerful model where it’s most needed while opting for a less expensive option elsewhere.

As for Whisper-large-v3, though prompting for Whisper is common in prior works [41, 42, 43], it is not employed in our experiments due to the unclear effect of prompting methods of Whisper [44]. For other models involved in this work, we used the default settings, with the generation strategy consistently set to greedy decoding.

5 Results

5.1 Toolset Construction

Table 1: Accuracy (%) of the models across the aspects of Dynamic-SUPERB. The best performance in each aspect is marked in bold, while the second-best one is underlined. “# of Tasks” represents the number of tasks under each aspect in Dynamic-SUPERB.

	Audio	Content	Degradation	Paralinguistics	Semantic	Speaker	Average
# of Tasks	7	11	19	7	6	5	55
Qwen-Audio-Chat [32]	73.2	63.3	31.1	29.3	48.1	41.4	45.5
SALMONN [33]	15.0	52.0	28.2	24.5	50.8	33.2	33.7
LTU-AS [34]	14.5	44.0	37.5	17.1	36.0	40.2	33.4
WavLLM [35]	22.3	53.3	36.8	24.6	51.0	22.3	36.9
ASR + LLM	9.6	74.4	44.6	33.1	71.5	42.5	47.4
ASR + AAC + LLM	60.7	81.6	48.9	32.6	72.8	46.4	57.3
All Attributes + LLM	62.4	70.7	56.8	30.6	68.5	62.5	58.7
Speech-Copilot (Ours)	73.4	90.7	64.3	56.6	70.7	86.1	72.4

We first demonstrate the results of our toolset construction phase. We compare our task decomposition method, where we take all the instructions into consideration at once, with instance-level toolset creation, in which the tools are created based on a single instruction. More than 50 task instructions generated by GPT-3.5 are used for the construction, and the results with GPT-4 [45] are shown in Table 2.

Table 2: Number of sub-tasks with different toolset creation methods. ”w/ reflection” means self-reflection is used to de-duplicate similar sub-tasks, while ”w/o reflection” means no reflection is used.

	Ours (w/ reflection)	Ours (w/o reflection)	Instance-level creation
# of sub-tasks ( $\downarrow$ )	16	18	25

Table 3: Selected speech/audio models used for various modules.

Modules	Selected Model
Speech Recognition	Whisper-large-v3 [40]
Language Identification	Whisper-large-v3 [40]
Speech Detection	Qwen-Audio-Chat [32]
Speech Emotion Recognition	emotion2vec [46]
Speech-to-Noise Ratio (SNR) Estimation	Brouhaha [47]
Reverberation Detection	Qwen-Audio-Chat [32]
Accent Classification	CommonAccent [48]
Stress Position Identification	Whisper-large-v3 [40], GPT-3.5 [39]
Spoofing Detection	Qwen-Audio-Chat [32]
Music Chord Classification	autochord [49]
Sythetic Speech Detection	Qwen-Audio-Chat [32]
Speaker Verification	NVIDIA TitaNet-Large [50, 51]
Speaker Diarization	pyannote speaker-diarization-3.1 [52]
Sound Classification	Qwen-Audio-Chat [32], GPT-3.5 [39]
Query LLM	GPT-3.5 [39]
Speaker Distance Estimation	Qwen-Audio-Chat [32]

As expected, our decomposition method, which considers all instructions collectively, significantly reduces the size of the sub-task set compared to instance-level tool creation. This reduction facilitates subsequent modularization and implementation. In addition, the size can be further reduced if the LLM is required to reflect on whether there are similar sub-tasks that can be unified or combined into single speech-processing tasks. This encourages the LLM to unify tasks of similar nature. For instance, the LLM can combine tasks related to the number of speakers with speaker diarization sub-tasks, thereby reducing the number of required sub-tasks.

As for the task modularization, we again employ GPT-4o to transform the collected sub-tasks into modules with detailed documentation with requirements outlined in Sec. 3.1.2. The resulting modules are listed in Table 3. We select the models for those modules based on the principles clarified in Sec. 3.1.2. For some modules where suitable and publicly available models are unavailable, we employ Qwen-Audio-Chat as the foundation model and realize these modules by prompting Qwen-Audio-Chat with a fixed set of prompts. The selected models for the modules are listed in Table 3.

Regarding the implementation details, most of the modules can be realized with the selected models directly. We briefly explain the implementation for some modules that require special handling. Stress position identification, a module for identifying stress syllables in spoken words, lacks a specialized model designed for this module. We hypothesize that this task can be approximated by combining speech recognition and LLMs due to their strong linguistic knowledge [53]. Thus, we implement this module using Whisper-large-v3 and GPT-3.5. Sound classification is a module for classifying a wide variety of sounds, e.g. environmental sounds, animal sounds, etc. Due to the high diversity of sound categories, it is difficult to find a single model to handle all kinds of classification tasks, so we choose to use Qwen-Audio-Chat as the backbone. To maintain generalizability, instead of using specific prompts for certain kinds of sound classification, we require Qwen-Audio-Chat to generate detailed audio captions, from which GPT-3.5 is adopted to extract the desired information based on the specific task objective.

5.2 Benchmark Performances

We then compare the performances of Speech-Copilot with the selected baselines on Dynamic-SUPERB, as shown in Table 1. Overall, Speech-Copilot achieves the highest average score across the 55 tasks and outperforms other baselines in 5 out of 6 aspects, indicating the efficacy of the constructed toolset and the problem-solving capability of the LLM-based program-generating agent.

We observe that the LALM baselines, i.e. Qwen-Audio-Chat, SALMONN, LTU-AS, and WavLLM, typically encounter significant issues that impact their performances. For instance, Qwen-Audio-Chat suffers from severe hallucinations about the spoken content in the provided audio when required to answer the questions directly. However, it can almost correctly identify the spoken content if asked to provide an audio caption of the audio input instead. This observation aligns with prior works [54] on the hallucination of LALMs, indicating the vulnerability of recent LALMs. We also notice that these models struggle with clearly selecting one and only one option for the questions. For example, SALMONN tends to list all of them in the response, which is unacceptable. Moreover, these LALMs underperform compared to cascaded systems. Even the simplest “ASR+LLM” systems outperform end-to-end LALMs, indicating that current LALMs are not yet capable of handling complex speech/audio understanding and reasoning tasks, such as those in Dynamic-SUPERB. In contrast, Speech-Copilot demonstrates strong robustness due to its modular design, achieving significantly better performance on the evaluation benchmark.

Comparing the performances between the cascaded systems, it seems that the overall performance will be better if more information from different speech models is provided to the LLM. However, they still underperform relative to Speech-Copilot. Notably, there is a significant performance gap between the “All Attributes + LLM” baseline, which is the most effective cascaded baseline in terms of the average score, and Speech-Copilot, despite using the same modules. The key difference lies in the utilization of information from these modules. The former incorporates all available information indiscriminately, regardless of the query’s purpose, while the latter selectively uses and combines relevant modules based on an analysis of the input queries. This highlights the importance of selecting related and useful information to avoid being misled by redundancy. Additionally, the computation cost of the former system is higher since it requires running all the modules for all data, showcasing the benefit of efficient information selection considering computation budgets. Such a selection process is achieved by query analysis and programming in Speech-Copilot, demonstrating the advantage of a program-generating agent.

Table 4: The average number of reasoning steps and modules used.

	AUD	CNT	DEG	PRL	SEM	SPK	AVG
Avg # of steps	4.7	3.6	3.8	3.8	4.3	3.1	3.9
Avg # of modules	2.0	1.7	1.4	1.7	2.1	1.1	1.6

5.3 Further Study

5.3.1 Statistics on Reasoning Steps and Modules Used

We analyzed the complexity of the programs generated by Speech-Copilot by examining the number of reasoning steps and modules used. Reasoning steps include operations like module calls, conditional statements, iteration statements, etc., that are necessary for the LLM to solve problems. For instance, Fig. 2 shows 5 steps. The number of reasoning steps represents the overall complexity of the programs, considering the model’s diverse behaviors, such as combining modules for decision-making or processing module outputs with conditional and iteration operations. Table 4 presents the results across aspects of Dynamic-SUPERB. Speech-Copilot takes 3 to 5 reasoning steps when programming to solve these tasks, indicating its ability to perform complicated operations, like utilizing multiple modules or designing algorithms, rather than simply selecting a module and returning its output. Furthermore, the difference between the number of reasoning steps and used modules shows that our proposed pipeline not only uses modules but also performs necessary reasoning based on those modules, as these tasks cannot be solved using a single module.

5.3.2 In-the-wild Multi-task Examples

Fig. 2 demonstrates a practical application of our pipeline through a real-life scenario, showing how various audio and speech-processing modules collaborate to provide a comprehensive understanding of a voice message. In contrast, existing large audio-language models, while proficient in speech recognition, often struggle with multitasking capabilities. These models face challenges in simultaneously processing speaker identity, emotion recognition, and sound classification while generating informative textual output. This comparison highlights the advantages of our proposed pipeline over existing large audio-language models. By integrating multiple audio and speech-processing modules, our pipeline can deliver a more comprehensive and informative analysis of the given audio.

6 Conclusion and Future work

Speech-Copilot provides a practical and efficient approach to handling diverse, instruction-oriented speech-processing tasks. By breaking down complex instructions into manageable sub-tasks, formulating sub-tasks into code modules, and employing a flexible program-generating LLM-based agent to utilize the modules, our framework reduces the effort needed for toolset construction and enhances performance, achieving state-of-the-art performance on Dynamic-SUPERB benchmark. Our future work could explore using multiple modules or foundation models for each function and applying reinforcement learning from human feedback [55] to optimally select the best module for a given task. Furthermore, to address the evolving and diverse nature of speech-processing tasks, we could expand the coverage of modules to tasks challenging for current speech foundation models [56, 57]. This will further enhance the power and adaptability of Speech-Copilot.

References

[1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
[2] Barret Zoph, Colin Raffel, Dale Schuurmans, Dani Yogatama, Denny Zhou, Don Metzler, Ed H Chi, Jason Wei, Jeff Dean, Liam B Fedus, et al., “Emergent abilities of large language models,” TMLR, 2022.
[3] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022.
[4] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning. PMLR, 2022, pp. 9118–9147.
[5] Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen, “Understanding the planning of llm agents: A survey,” 2024.
[6] Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati, “On the planning abilities of large language models-a critical investigation,” Advances in Neural Information Processing Systems, vol. 36, pp. 75993–76005, 2023.
[7] Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han, “Large language models can self-improve,” arXiv preprint arXiv:2210.11610, 2022.
[8] Liangming Pan et al., “Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 484–506, 2024.
[9] Ning Miao, Yee Whye Teh, and Tom Rainforth, “Selfcheck: Using llms to zero-shot check their own step-by-step reasoning,” arXiv preprint arXiv:2308.00436, 2023.
[10] Philipp Schoenegger, Peter S Park, Ezra Karger, and Philip E Tetlock, “Ai-augmented predictions: Llm assistants improve human forecasting accuracy,” arXiv preprint arXiv:2402.07862, 2024.
[11] Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Kai Shu, Adel Bibi, Ziniu Hu, Philip Torr, Bernard Ghanem, and Guohao Li, “Can large language model agents simulate human trust behaviors?,” 2024.
[12] Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, and Mark Gerstein, “Prioritizing safeguarding over autonomy: Risks of llm agents for science,” 2024.
[13] Timo Schick et al., “Toolformer: language models can teach themselves to use tools. 2023,” arXiv preprint arXiv:2302.04761, 2023.
[14] Dídac Surís, Sachit Menon, and Carl Vondrick, “Vipergpt: Visual inference via python execution for reasoning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11888–11898.
[15] Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, and Shinji Watanabe, “Audiogpt: Understanding and generating speech, music, sound, and talking head,” 2023.
[16] Yu Du, Fangyun Wei, and Hongyang Zhang, “Anytool: Self-reflective, hierarchical agents for large-scale api calls,” 2024.
[17] Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou, “Large language models as tool makers,” arXiv preprint arXiv:2305.17126, 2023.
[18] Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung, Hao Peng, and Heng Ji, “Craft: Customizing llms by creating and retrieving from specialized toolsets,” 2024.
[19] Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji, “Creator: Tool creation for disentangling abstract and concrete reasoning of large language models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 6922–6939.
[20] Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, et al., “Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12136–12140.
[21] Shijue Huang et al., “Planning, creation, usage: Benchmarking llms for comprehensive tool utilization in real-world complex scenarios,” arXiv preprint arXiv:2401.17167, 2024.
[22] Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu, “Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models,” arXiv preprint arXiv:2403.07714, 2024.
[23] Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li, “Api-bank: A comprehensive benchmark for tool-augmented llms,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
[24] Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and Deqing Yang, “Easytool: Enhancing llm-based agents with concise tool instruction,” arXiv preprint arXiv:2401.06201, 2024.
[25] Junjie Ye, Yilong Wu, Songyang Gao, Sixian Li, Guanyu Li, Xiaoran Fan, Qi Zhang, Tao Gui, and Xuanjing Huang, “Rotbench: A multi-level benchmark for evaluating the robustness of large language models in tool learning,” arXiv preprint arXiv:2401.08326, 2024.
[26] Zhengliang Shi, Shen Gao, Xiuyi Chen, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Pengjie Ren, Suzan Verberne, and Zhaochun Ren, “Learning to use tools via cooperative and interactive agents,” arXiv preprint arXiv:2403.03031, 2024.
[27] Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez, “Gorilla: Large language model connected with massive apis,” arXiv preprint arXiv:2305.15334, 2023.
[28] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,” arXiv preprint arXiv:2307.16789, 2023.
[29] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[30] Cheng-Han Chiang and Hung-yi Lee, “Can large language models be an alternative to human evaluations?,” arXiv preprint arXiv:2305.01937, 2023.
[31] Cheng-Han Chiang and Hung-yi Lee, “A closer look into automatic evaluation using large language models,” arXiv preprint arXiv:2310.05657, 2023.
[32] Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv:2311.07919, 2023.
[33] Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang, “Salmonn: Towards generic hearing abilities for large language models,” arXiv:2310.13289, 2023.
[34] Yuan Gong, Alexander H Liu, Hongyin Luo, Leonid Karlinsky, and James Glass, “Joint audio and speech understanding,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
[35] Shujie Hu et al., “Wavllm: Towards robust and adaptive speech large language model,” arXiv preprint arXiv:2404.00656, 2024.
[36] Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, and Hung yi Lee, “Desta: Enhancing speech language models through descriptive speech-text alignment,” arXiv preprint arXiv:2406.18871, 2024.
[37] Zhifeng Kong et al., “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,” arXiv:2402.01831, 2024.
[38] Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass, “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.
[39] OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2022, Accessed on October 10, 2023.
[40] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
[41] Puyuan Peng, Brian Yan, Shinji Watanabe, and David Harwath, “Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization,” in Proc. INTERSPEECH 2023, 2023, pp. 396–400.
[42] Chih-Kai Yang, Kuan-Po Huang, Ke-Han Lu, Chun-Yi Kuan, Chi-Yuan Hsiao, and Hung yi Lee, “Investigating zero-shot generalizability on mandarin-english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision,” 2023.
[43] Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi Li, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, et al., “Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgpt,” arXiv preprint arXiv:2306.17103, 2023.
[44] Chih-Kai Yang, Kuan-Po Huang, and Hung yi Lee, “Do prompts really prompt? exploring the prompt understanding capability of whisper,” 2024.
[45] OpenAI, “Gpt-4 technical report,” 2023.
[46] Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” arXiv preprint arXiv:2312.15185, 2023.
[47] Marvin Lavechin, Marianne Métais, Hadrien Titeux, Alodie Boissonnet, Jade Copet, Morgane Rivière, Elika Bergelson, Alejandrina Cristia, Emmanuel Dupoux, and Hervé Bredin, “Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and c50 room acoustics estimation,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–7.
[48] Juan Zuluaga-Gomez, Sara Ahmed, Danielius Visockas, and Cem Subakan, “Commonaccent: Exploring large acoustic pretrained models for accent classification based on common voice,” Interspeech 2023, 2023.
[49] Christopher John Bayron, “Autochord: Automatic chord recognition library and chord visualization app,” .
[50] Nithin Rao Koluguri, Taejin Park, and Boris Ginsburg, “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8102–8106.
[51] Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, et al., “Nemo: a toolkit for building ai applications using neural modules,” arXiv preprint arXiv:1909.09577, 2019.
[52] Alexis Plaquet and Hervé Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. INTERSPEECH 2023, 2023.
[53] Ashima Suvarna, Harshita Khandelwal, and Nanyun Peng, “Phonologybench: Evaluating phonological skills of large language models,” arXiv preprint arXiv:2404.02456, 2024.
[54] Chun-Yi Kuan, Wei-Ping Huang, and Hung-yi Lee, “Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,” arXiv preprint arXiv:2406.08402, 2024.
[55] Long Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
[56] Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen De Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, and Emmanuel Dupoux, “The zero resource speech challenge 2021: Spoken language modelling,” arXiv preprint arXiv:2104.14700, 2021.
[57] Kuan-Po Huang, Chih-Kai Yang, Yu-Kuan Fu, Ewan Dunbar, and Hung-yi Lee, “Zero resource code-switched speech benchmark using speech utterance pairs for multiple spoken languages,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10006–10010.