-
Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations
Authors:
Yucheng Jiang,
Yijia Shao,
Dekun Ma,
Sina J. Semnani,
Monica S. Lam
Abstract:
While language model (LM)-powered chatbots and generative search engines excel at answering concrete queries, discovering information in the terrain of unknown unknowns remains challenging for users. To emulate the common educational scenario where children/students learn by listening to and participating in conversations of their parents/teachers, we create Collaborative STORM (Co-STORM). Unlike…
▽ More
While language model (LM)-powered chatbots and generative search engines excel at answering concrete queries, discovering information in the terrain of unknown unknowns remains challenging for users. To emulate the common educational scenario where children/students learn by listening to and participating in conversations of their parents/teachers, we create Collaborative STORM (Co-STORM). Unlike QA systems that require users to ask all the questions, Co-STORM lets users observe and occasionally steer the discourse among several LM agents. The agents ask questions on the user's behalf, allowing the user to discover unknown unknowns serendipitously. To facilitate user interaction, Co-STORM assists users in tracking the discourse by organizing the uncovered information into a dynamic mind map, ultimately generating a comprehensive report as takeaways. For automatic evaluation, we construct the WildSeek dataset by collecting real information-seeking records with user goals. Co-STORM outperforms baseline methods on both discourse trace and report quality. In a further human evaluation, 70% of participants prefer Co-STORM over a search engine, and 78% favor it over a RAG chatbot.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
Intraoperative Glioma Segmentation with YOLO + SAM for Improved Accuracy in Tumor Resection
Authors:
Samir Kassam,
Angelo Markham,
Katie Vo,
Yashas Revanakara,
Michael Lam,
Kevin Zhu
Abstract:
Gliomas, a common type of malignant brain tumor, present significant surgical challenges due to their similarity to healthy tissue. Preoperative Magnetic Resonance Imaging (MRI) images are often ineffective during surgery due to factors such as brain shift, which alters the position of brain structures and tumors. This makes real-time intraoperative MRI (ioMRI) crucial, as it provides updated imag…
▽ More
Gliomas, a common type of malignant brain tumor, present significant surgical challenges due to their similarity to healthy tissue. Preoperative Magnetic Resonance Imaging (MRI) images are often ineffective during surgery due to factors such as brain shift, which alters the position of brain structures and tumors. This makes real-time intraoperative MRI (ioMRI) crucial, as it provides updated imaging that accounts for these shifts, ensuring more accurate tumor localization and safer resections. This paper presents a deep learning pipeline combining You Only Look Once Version 8 (YOLOv8) and Segment Anything Model Vision Transformer-base (SAM ViT-b) to enhance glioma detection and segmentation during ioMRI. Our model was trained using the Brain Tumor Segmentation 2021 (BraTS 2021) dataset, which includes standard magnetic resonance imaging (MRI) images, and noise-augmented MRI images that simulate ioMRI images. Noised MRI images are harder for a deep learning pipeline to segment, but they are more representative of surgical conditions. Achieving a Dice Similarity Coefficient (DICE) score of 0.79, our model performs comparably to state-of-the-art segmentation models tested on noiseless data. This performance demonstrates the model's potential to assist surgeons in maximizing tumor resection and improving surgical outcomes.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
Foundation Models for Music: A Survey
Authors:
Yinghao Ma,
Anders Øland,
Anton Ragni,
Bleiz MacSen Del Sette,
Charalampos Saitis,
Chris Donahue,
Chenghua Lin,
Christos Plachouras,
Emmanouil Benetos,
Elio Quinton,
Elona Shatri,
Fabio Morreale,
Ge Zhang,
György Fazekas,
Gus Xia,
Huan Zhang,
Ilaria Manco,
Jiawen Huang,
Julien Guinot,
Liwei Lin,
Luca Marinelli,
Max W. Y. Lam,
Megha Sharma,
Qiuqiang Kong,
Roger B. Dannenberg
, et al. (18 additional authors not shown)
Abstract:
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the signifi…
▽ More
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
△ Less
Submitted 27 August, 2024; v1 submitted 26 August, 2024;
originally announced August 2024.
-
Enhancing Depression Diagnosis with Chain-of-Thought Prompting
Authors:
Elysia Shi,
Adithri Manda,
London Chowdhury,
Runeema Arun,
Kevin Zhu,
Michael Lam
Abstract:
When using AI to detect signs of depressive disorder, AI models habitually draw preemptive conclusions. We theorize that using chain-of-thought (CoT) prompting to evaluate Patient Health Questionnaire-8 (PHQ-8) scores will improve the accuracy of the scores determined by AI models. In our findings, when the models reasoned with CoT, the estimated PHQ-8 scores were consistently closer on average to…
▽ More
When using AI to detect signs of depressive disorder, AI models habitually draw preemptive conclusions. We theorize that using chain-of-thought (CoT) prompting to evaluate Patient Health Questionnaire-8 (PHQ-8) scores will improve the accuracy of the scores determined by AI models. In our findings, when the models reasoned with CoT, the estimated PHQ-8 scores were consistently closer on average to the accepted true scores reported by each participant compared to when not using CoT. Our goal is to expand upon AI models' understanding of the intricacies of human conversation, allowing them to more effectively assess a patient's feelings and tone, therefore being able to more accurately discern mental disorder symptoms; ultimately, we hope to augment AI models' abilities, so that they can be widely accessible and used in the medical field.
△ Less
Submitted 27 August, 2024; v1 submitted 26 August, 2024;
originally announced August 2024.
-
SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions
Authors:
Shicheng Liu,
Sina J. Semnani,
Harold Triedman,
Jialiang Xu,
Isaac Dan Zhao,
Monica S. Lam
Abstract:
Recent work integrating Large Language Models (LLMs) has led to significant improvements in the Knowledge Base Question Answering (KBQA) task. However, we posit that existing KBQA datasets that either have simple questions, use synthetically generated logical forms, or are based on small knowledge base (KB) schemas, do not capture the true complexity of KBQA tasks.
To address this, we introduce…
▽ More
Recent work integrating Large Language Models (LLMs) has led to significant improvements in the Knowledge Base Question Answering (KBQA) task. However, we posit that existing KBQA datasets that either have simple questions, use synthetically generated logical forms, or are based on small knowledge base (KB) schemas, do not capture the true complexity of KBQA tasks.
To address this, we introduce the SPINACH dataset, an expert-annotated KBQA dataset collected from forum discussions on Wikidata's "Request a Query" forum with 320 decontextualized question-SPARQL pairs. Much more complex than existing datasets, SPINACH calls for strong KBQA systems that do not rely on training data to learn the KB schema, but can dynamically explore large and often incomplete schemas and reason about them.
Along with the dataset, we introduce the SPINACH agent, a new KBQA approach that mimics how a human expert would write SPARQLs for such challenging questions. Experiments on existing datasets show SPINACH's capability in KBQA, achieving a new state of the art on the QALD-7, QALD-9 Plus and QALD-10 datasets by 30.1%, 27.0%, and 10.0% in F1, respectively, and coming within 1.6% of the fine-tuned LLaMA SOTA model on WikiWebQuestions. On our new SPINACH dataset, SPINACH agent outperforms all baselines, including the best GPT-4-based KBQA agent, by 38.1% in F1.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
LLM-Based Open-Domain Integrated Task and Knowledge Assistants with Programmable Policies
Authors:
Harshit Joshi,
Shicheng Liu,
James Chen,
Robert Weigle,
Monica S. Lam
Abstract:
Programming LLM-based knowledge and task assistants that faithfully conform to developer-provided policies is challenging. These agents must retrieve and provide consistent, accurate, and relevant information to address user's queries and needs. Yet such agents generate unfounded responses ("hallucinate"). Traditional dialogue trees can only handle a limited number of conversation flows, making th…
▽ More
Programming LLM-based knowledge and task assistants that faithfully conform to developer-provided policies is challenging. These agents must retrieve and provide consistent, accurate, and relevant information to address user's queries and needs. Yet such agents generate unfounded responses ("hallucinate"). Traditional dialogue trees can only handle a limited number of conversation flows, making them inherently brittle. To this end, we present KITA - a programmable framework for creating task-oriented conversational agents that are designed to handle complex user interactions. Unlike LLMs, KITA provides reliable grounded responses, with controllable agent policies through its expressive specification, KITA Worksheet. In contrast to dialog trees, it is resilient to diverse user queries, helpful with knowledge sources, and offers ease of programming policies through its declarative paradigm. Through a real-user study involving 62 participants, we show that KITA beats the GPT-4 with function calling baseline by 26.1, 22.5, and 52.4 points on execution accuracy, dialogue act accuracy, and goal completion rate, respectively. We also release 22 real-user conversations with KITA manually corrected to ensure accuracy.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Zero-shot Persuasive Chatbots with LLM-Generated Strategies and Information Retrieval
Authors:
Kazuaki Furumai,
Roberto Legaspi,
Julio Vizcarra,
Yudai Yamazaki,
Yasutaka Nishimura,
Sina J. Semnani,
Kazushi Ikeda,
Weiyan Shi,
Monica S. Lam
Abstract:
Persuasion plays a pivotal role in a wide range of applications from health intervention to the promotion of social good. Persuasive chatbots can accelerate the positive effects of persuasion in such applications. Existing methods rely on fine-tuning persuasive chatbots with task-specific training data which is costly, if not infeasible, to collect. To address this issue, we propose a method to le…
▽ More
Persuasion plays a pivotal role in a wide range of applications from health intervention to the promotion of social good. Persuasive chatbots can accelerate the positive effects of persuasion in such applications. Existing methods rely on fine-tuning persuasive chatbots with task-specific training data which is costly, if not infeasible, to collect. To address this issue, we propose a method to leverage the generalizability and inherent persuasive abilities of large language models (LLMs) in creating effective and truthful persuasive chatbot for any given domain in a zero-shot manner. Unlike previous studies which used pre-defined persuasion strategies, our method first uses an LLM to generate responses, then extracts the strategies used on the fly, and replaces any unsubstantiated claims in the response with retrieved facts supporting the strategies. We applied our chatbot, PersuaBot, to three significantly different domains needing persuasion skills: donation solicitation, recommendations, and health intervention. Our experiments on simulated and human conversations show that our zero-shot approach is more persuasive than prior work, while achieving factual accuracy surpassing state-of-the-art knowledge-oriented chatbots. Our study demonstrated that when persuasive chatbots are employed responsibly for social good, it is an enabler of positive individual and social change.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Authors:
Omar Shaikh,
Michelle Lam,
Joey Hejna,
Yijia Shao,
Michael Bernstein,
Diyi Yang
Abstract:
Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number ($<10$) o…
▽ More
Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number ($<10$) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns language model outputs to a user's demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users' demonstrations as preferred over output from the LLM and its intermediate checkpoints. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants ($N=16$). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an average of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
SPAGHETTI: Open-Domain Question Answering from Heterogeneous Data Sources with Retrieval and Semantic Parsing
Authors:
Heidi C. Zhang,
Sina J. Semnani,
Farhad Ghassemi,
Jialiang Xu,
Shicheng Liu,
Monica S. Lam
Abstract:
We introduce SPAGHETTI: Semantic Parsing Augmented Generation for Hybrid English information from Text Tables and Infoboxes, a hybrid question-answering (QA) pipeline that utilizes information from heterogeneous knowledge sources, including knowledge base, text, tables, and infoboxes. Our LLM-augmented approach achieves state-of-the-art performance on the Compmix dataset, the most comprehensive he…
▽ More
We introduce SPAGHETTI: Semantic Parsing Augmented Generation for Hybrid English information from Text Tables and Infoboxes, a hybrid question-answering (QA) pipeline that utilizes information from heterogeneous knowledge sources, including knowledge base, text, tables, and infoboxes. Our LLM-augmented approach achieves state-of-the-art performance on the Compmix dataset, the most comprehensive heterogeneous open-domain QA dataset, with 56.5% exact match (EM) rate. More importantly, manual analysis on a sample of the dataset suggests that SPAGHETTI is more than 90% accurate, indicating that EM is no longer suitable for assessing the capabilities of QA systems today.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
A Closer Look at Logical Reasoning with LLMs: The Choice of Tool Matters
Authors:
Long Hei Matthew Lam,
Ramya Keerthy Thatikonda,
Ehsan Shareghi
Abstract:
The emergence of Large Language Models (LLMs) has demonstrated promising progress in solving logical reasoning tasks effectively. Several recent approaches have proposed to change the role of the LLM from the reasoner into a translator between natural language statements and symbolic representations which are then sent to external symbolic solvers to resolve. This paradigm has established the curr…
▽ More
The emergence of Large Language Models (LLMs) has demonstrated promising progress in solving logical reasoning tasks effectively. Several recent approaches have proposed to change the role of the LLM from the reasoner into a translator between natural language statements and symbolic representations which are then sent to external symbolic solvers to resolve. This paradigm has established the current state-of-the-art result in logical reasoning (i.e., deductive reasoning). However, it remains unclear whether the variance in performance of these approaches stems from the methodologies employed or the specific symbolic solvers utilized. There is a lack of consistent comparison between symbolic solvers and how they influence the overall reported performance. This is important, as each symbolic solver also has its own input symbolic language, presenting varying degrees of challenge in the translation process. To address this gap, we perform experiments on 3 deductive reasoning benchmarks with LLMs augmented with widely used symbolic solvers: Z3, Pyke, and Prover9. The tool-executable rates of symbolic translation generated by different LLMs exhibit a near 50% performance variation. This highlights a significant difference in performance rooted in very basic choices of tools. The almost linear correlation between the executable rate of translations and the accuracy of the outcomes from Prover9 highlight a strong alignment between LLMs ability to translate into Prover9 symbolic language, and the correctness of those translations.
△ Less
Submitted 11 July, 2024; v1 submitted 31 May, 2024;
originally announced June 2024.
-
Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents
Authors:
Andrew H. Lee,
Sina J. Semnani,
Galo Castillo-López,
Gäel de Chalendar,
Monojit Choudhury,
Ashna Dua,
Kapil Rajesh Kavitha,
Sungkyun Kim,
Prashant Kodali,
Ponnurangam Kumaraguru,
Alexis Lombard,
Mehrad Moradshahi,
Gihyun Park,
Nasredine Semmar,
Jiwon Seo,
Tianhao Shen,
Manish Shrivastava,
Deyi Xiong,
Monica S. Lam
Abstract:
Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD.
To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are mor…
▽ More
Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD.
To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA.
However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
△ Less
Submitted 16 June, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM
Authors:
Michelle S. Lam,
Janice Teoh,
James Landay,
Jeffrey Heer,
Michael S. Bernstein
Abstract:
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online…
▽ More
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Voice EHR: Introducing Multimodal Audio Data for Health
Authors:
James Anibal,
Hannah Huth,
Ming Li,
Lindsey Hazen,
Yen Minh Lam,
Hang Nguyen,
Phuc Hong,
Michael Kleinman,
Shelley Ost,
Christopher Jackson,
Laura Sprabery,
Cheran Elangovan,
Balaji Krishnaiah,
Lee Akst,
Ioan Lina,
Iqbal Elyazar,
Lenny Ekwati,
Stefan Jansen,
Richard Nduwayezu,
Charisse Garcia,
Jeffrey Plum,
Jacqueline Brenner,
Miranda Song,
Emily Ricotta,
David Clifton
, et al. (3 additional authors not shown)
Abstract:
Large AI models trained on audio data may have the potential to rapidly classify patients, enhancing medical decision-making and potentially improving outcomes through early detection. Existing technologies depend on limited datasets using expensive recording equipment in high-income, English-speaking countries. This challenges deployment in resource-constrained, high-volume settings where audio d…
▽ More
Large AI models trained on audio data may have the potential to rapidly classify patients, enhancing medical decision-making and potentially improving outcomes through early detection. Existing technologies depend on limited datasets using expensive recording equipment in high-income, English-speaking countries. This challenges deployment in resource-constrained, high-volume settings where audio data may have a profound impact. This report introduces a novel data type and a corresponding collection system that captures health data through guided questions using only a mobile/web application. This application ultimately results in an audio electronic health record (voice EHR) which may contain complex biomarkers of health from conventional voice/respiratory features, speech patterns, and language with semantic meaning - compensating for the typical limitations of unimodal clinical datasets. This report introduces a consortium of partners for global work, presents the application used for data collection, and showcases the potential of informative voice EHR to advance the scalability and diversity of audio AI.
△ Less
Submitted 1 June, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
Authors:
Jen-tse Huang,
Eric John Li,
Man Ho Lam,
Tian Liang,
Wenxuan Wang,
Youliang Yuan,
Wenxiang Jiao,
Xing Wang,
Zhaopeng Tu,
Michael R. Lyu
Abstract:
Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce o…
▽ More
Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, GAMA-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 60.5. Moreover, Gemini-1.0-Pro and GPT-3.5 (0613, 1106, 0125) demonstrate similar intelligence on GAMA-Bench. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.
△ Less
Submitted 25 April, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
Authors:
Yijia Shao,
Yucheng Jiang,
Theodore A. Kanell,
Peter Xu,
Omar Khattab,
Monica S. Lam
Abstract:
We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system for the Synthesis of Topic Outlines through Retriev…
▽ More
We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system for the Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. STORM models the pre-writing stage by (1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.
For evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage. We further gather feedback from experienced Wikipedia editors. Compared to articles generated by an outline-driven retrieval-augmented baseline, more of STORM's articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%). The expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.
△ Less
Submitted 8 April, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
Clarify: Improving Model Robustness With Natural Language Corrections
Authors:
Yoonho Lee,
Michelle S. Lam,
Helena Vasconcelos,
Michael S. Bernstein,
Chelsea Finn
Abstract:
The standard way to teach models is by feeding them lots of data. However, this approach often teaches models incorrect ideas because they pick up on misleading signals in the data. To prevent such misconceptions, we must necessarily provide additional information beyond the training data. Prior methods incorporate additional instance-level supervision, such as labels for misleading features or ad…
▽ More
The standard way to teach models is by feeding them lots of data. However, this approach often teaches models incorrect ideas because they pick up on misleading signals in the data. To prevent such misconceptions, we must necessarily provide additional information beyond the training data. Prior methods incorporate additional instance-level supervision, such as labels for misleading features or additional labels for debiased data. However, such strategies require a large amount of labeler effort. We hypothesize that people are good at providing textual feedback at the concept level, a capability that existing teaching frameworks do not leverage. We propose Clarify, a novel interface and method for interactively correcting model misconceptions. Through Clarify, users need only provide a short text description of a model's consistent failure patterns. Then, in an entirely automated way, we use such descriptions to improve the training process. Clarify is the first end-to-end system for user model correction. Our user studies show that non-expert users can successfully describe model misconceptions via Clarify, leading to increased worst-case performance in two datasets. We additionally conduct a case study on a large-scale image dataset, ImageNet, using Clarify to find and rectify 31 novel hard subpopulations.
△ Less
Submitted 21 August, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows
Authors:
Madeleine Grunde-McLaughlin,
Michelle S. Lam,
Ranjay Krishna,
Daniel S. Weld,
Jeffrey Heer
Abstract:
LLM chains enable complex tasks by decomposing work into a sequence of subtasks. Similarly, the more established techniques of crowdsourcing workflows decompose complex tasks into smaller tasks for human crowdworkers. Chains address LLM errors analogously to the way crowdsourcing workflows address human error. To characterize opportunities for LLM chaining, we survey 107 papers across the crowdsou…
▽ More
LLM chains enable complex tasks by decomposing work into a sequence of subtasks. Similarly, the more established techniques of crowdsourcing workflows decompose complex tasks into smaller tasks for human crowdworkers. Chains address LLM errors analogously to the way crowdsourcing workflows address human error. To characterize opportunities for LLM chaining, we survey 107 papers across the crowdsourcing and chaining literature to construct a design space for chain development. The design space covers a designer's objectives and the tactics used to build workflows. We then surface strategies that mediate how workflows use tactics to achieve objectives. To explore how techniques from crowdsourcing may apply to chaining, we adapt crowdsourcing workflows to implement LLM chains across three case studies: creating a taxonomy, shortening text, and writing a short story. From the design space and our case studies, we identify takeaways for effective chain design and raise implications for future research and development.
△ Less
Submitted 6 May, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
A Unifying Tensor View for Lightweight CNNs
Authors:
Jason Chun Lok Li,
Rui Lin,
Jiajun Zhou,
Edmund Yin Mun Lam,
Ngai Wong
Abstract:
Despite the decomposition of convolutional kernels for lightweight CNNs being well studied, existing works that rely on tensor network diagrams or hyperdimensional abstraction lack geometry intuition. This work devises a new perspective by linking a 3D-reshaped kernel tensor to its various slice-wise and rank-1 decompositions, permitting a straightforward connection between various tensor approxim…
▽ More
Despite the decomposition of convolutional kernels for lightweight CNNs being well studied, existing works that rely on tensor network diagrams or hyperdimensional abstraction lack geometry intuition. This work devises a new perspective by linking a 3D-reshaped kernel tensor to its various slice-wise and rank-1 decompositions, permitting a straightforward connection between various tensor approximations and efficient CNN modules. Specifically, it is discovered that a pointwise-depthwise-pointwise (PDP) configuration constitutes a viable construct for lightweight CNNs. Moreover, a novel link to the latest ShiftNet is established, inspiring a first-ever shift layer pruning that achieves nearly 50% compression with < 1% drop in accuracy for ShiftResNet.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models
Authors:
Shicheng Liu,
Jialiang Xu,
Wesley Tjangnaka,
Sina J. Semnani,
Chen Jie Yu,
Monica S. Lam
Abstract:
While most conversational agents are grounded on either free-text or structured knowledge, many knowledge corpora consist of hybrid sources. This paper presents the first conversational agent that supports the full generality of hybrid data access for large knowledge corpora, through a language we developed called SUQL (Structured and Unstructured Query Language). Specifically, SUQL extends SQL wi…
▽ More
While most conversational agents are grounded on either free-text or structured knowledge, many knowledge corpora consist of hybrid sources. This paper presents the first conversational agent that supports the full generality of hybrid data access for large knowledge corpora, through a language we developed called SUQL (Structured and Unstructured Query Language). Specifically, SUQL extends SQL with free-text primitives (summary and answer), so information retrieval can be composed with structured data accesses arbitrarily in a formal, succinct, precise, and interpretable notation. With SUQL, we propose the first semantic parser, an LLM with in-context learning, that can handle hybrid data sources.
Our in-context learning-based approach, when applied to the HybridQA dataset, comes within 8.9% exact match and 7.1% F1 of the SOTA, which was trained on 62K data samples. More significantly, unlike previous approaches, our technique is applicable to large databases and free-text corpora. We introduce a dataset consisting of crowdsourced questions and conversations on Yelp, a large, real restaurant knowledge base with structured and unstructured data. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 90.3% of the time, compared to 63.4% for a baseline based on linearization.
△ Less
Submitted 13 March, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
3D Self-Localization of Drones using a Single Millimeter-Wave Anchor
Authors:
Maisy Lam,
Laura Dodds,
Aline Eid,
Jimmy Hester,
Fadel Adib
Abstract:
We present the design, implementation, and evaluation of MiFly, a self-localization system for autonomous drones that works across indoor and outdoor environments, including low-visibility, dark, and GPS-denied settings. MiFly performs 6DoF self-localization by leveraging a single millimeter-wave (mmWave) anchor in its vicinity - even if that anchor is visually occluded. MmWave signals are used in…
▽ More
We present the design, implementation, and evaluation of MiFly, a self-localization system for autonomous drones that works across indoor and outdoor environments, including low-visibility, dark, and GPS-denied settings. MiFly performs 6DoF self-localization by leveraging a single millimeter-wave (mmWave) anchor in its vicinity - even if that anchor is visually occluded. MmWave signals are used in radar and 5G systems and can operate in the dark and through occlusions. MiFly introduces a new mmWave anchor design and mounts light-weight high-resolution mmWave radars on a drone. By jointly designing the localization algorithms and the novel low-power mmWave anchor hardware (including its polarization and modulation), the drone is capable of high-speed 3D localization. Furthermore, by intelligently fusing the location estimates from its mmWave radars and its IMUs, it can accurately and robustly track its 6DoF trajectory. We implemented and evaluated MiFly on a DJI drone. We demonstrate a median localization error of 7cm and a 90th percentile less than 15cm, even when the anchor is fully occluded (visually) from the drone.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench
Authors:
Jen-tse Huang,
Wenxuan Wang,
Eric John Li,
Man Ho Lam,
Shujie Ren,
Youliang Yuan,
Wenxiang Jiao,
Zhaopeng Tu,
Michael R. Lyu
Abstract:
Large Language Models (LLMs) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education. LLMs become more than mere applications, evolving into assistants capable of addressing diverse user requests. This narrows the distinction between human beings and artificial in…
▽ More
Large Language Models (LLMs) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education. LLMs become more than mere applications, evolving into assistants capable of addressing diverse user requests. This narrows the distinction between human beings and artificial intelligence agents, raising intriguing questions regarding the potential manifestation of personalities, temperaments, and emotions within LLMs. In this paper, we propose a framework, PsychoBench, for evaluating diverse psychological aspects of LLMs. Comprising thirteen scales commonly used in clinical psychology, PsychoBench further classifies these scales into four distinct categories: personality traits, interpersonal relationships, motivational tests, and emotional abilities. Our study examines five popular models, namely text-davinci-003, gpt-3.5-turbo, gpt-4, LLaMA-2-7b, and LLaMA-2-13b. Additionally, we employ a jailbreak approach to bypass the safety alignment protocols and test the intrinsic natures of LLMs. We have made PsychoBench openly accessible via https://github.com/CUHK-ARISE/PsychoBench.
△ Less
Submitted 22 January, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Sociotechnical Audits: Broadening the Algorithm Auditing Lens to Investigate Targeted Advertising
Authors:
Michelle S. Lam,
Ayush Pandit,
Colin H. Kalicki,
Rachit Gupta,
Poonam Sahoo,
Danaë Metaxa
Abstract:
Algorithm audits are powerful tools for studying black-box systems. While very effective in examining technical components, the method stops short of a sociotechnical frame, which would also consider users as an integral and dynamic part of the system. Addressing this gap, we propose the concept of sociotechnical auditing: auditing methods that evaluate algorithmic systems at the sociotechnical le…
▽ More
Algorithm audits are powerful tools for studying black-box systems. While very effective in examining technical components, the method stops short of a sociotechnical frame, which would also consider users as an integral and dynamic part of the system. Addressing this gap, we propose the concept of sociotechnical auditing: auditing methods that evaluate algorithmic systems at the sociotechnical level, focusing on the interplay between algorithms and users as each impacts the other. Just as algorithm audits probe an algorithm with varied inputs and observe outputs, a sociotechnical audit (STA) additionally probes users, exposing them to different algorithmic behavior and measuring resulting attitudes and behaviors. To instantiate this method, we develop Intervenr, a platform for conducting browser-based, longitudinal sociotechnical audits with consenting, compensated participants. Intervenr investigates the algorithmic content users encounter online and coordinates systematic client-side interventions to understand how users change in response. As a case study, we deploy Intervenr in a two-week sociotechnical audit of online advertising (N=244) to investigate the central premise that personalized ad targeting is more effective on users. In the first week, we collect all browser ads delivered to users, and in the second, we deploy an ablation-style intervention that disrupts normal targeting by randomly pairing participants and swapping all their ads. We collect user-oriented metrics (self-reported ad interest and feeling of representation) and advertiser-oriented metrics (ad views, clicks, and recognition) throughout, along with a total of over 500,000 ads. Our STA finds that targeted ads indeed perform better with users, but also that users begin to acclimate to different ads in only a week, casting doubt on the primacy of personalized ad targeting given the impact of repeated exposure.
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench
Authors:
Jen-tse Huang,
Man Ho Lam,
Eric John Li,
Shujie Ren,
Wenxuan Wang,
Wenxiang Jiao,
Zhaopeng Tu,
Michael R. Lyu
Abstract:
Evaluating Large Language Models' (LLMs) anthropomorphic capabilities has become increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, \ie, how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situat…
▽ More
Evaluating Large Language Models' (LLMs) anthropomorphic capabilities has become increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, \ie, how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes seven LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4, Mixtral-8x22B, and LLaMA-3.1. We find that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Our EmotionBench, including collected dataset of situations, the human evaluation results, and the code of our testing framework, is publicly available at https://github.com/CUHK-ARISE/EmotionBench.
△ Less
Submitted 12 August, 2024; v1 submitted 7 August, 2023;
originally announced August 2023.
-
Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding
Authors:
Jiachen Kang,
Wenjing Jia,
Xiangjian He,
Kin Man Lam
Abstract:
Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as "specialized images". This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality i…
▽ More
Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as "specialized images". This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with a novel pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under LINEAR fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already approached the results obtained with FULL model fine-tuning (92.66%), demonstrating its effective and robust representation capability.
△ Less
Submitted 23 April, 2024; v1 submitted 28 July, 2023;
originally announced July 2023.
-
Embedding Democratic Values into Social Media AIs via Societal Objective Functions
Authors:
Chenyan Jia,
Michelle S. Lam,
Minh Chau Mai,
Jeff Hancock,
Michael S. Bernstein
Abstract:
Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the…
▽ More
Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the political science construct of anti-democratic attitudes. Traditionally, we have lacked observable outcomes to use to train such models, however, the social sciences have developed survey instruments and qualitative codebooks for these constructs, and their precision facilitates translation into detailed prompts for large language models. We apply this method to create a democratic attitude model that estimates the extent to which a social media post promotes anti-democratic attitudes, and test this democratic attitude model across three studies. In Study 1, we first test the attitudinal and behavioral effectiveness of the intervention among US partisans (N=1,380) by manually annotating (alpha=.895) social media posts with anti-democratic attitude scores and testing several feed ranking conditions based on these scores. Removal (d=.20) and downranking feeds (d=.25) reduced participants' partisan animosity without compromising their experience and engagement. In Study 2, we scale up the manual labels by creating the democratic attitude model, finding strong agreement with manual labels (rho=.75). Finally, in Study 3, we replicate Study 1 using the democratic attitude model instead of manual labels to test its attitudinal and behavioral impact (N=558), and again find that the feed downranking using the societal objective function reduced partisan animosity (d=.25). This method presents a novel strategy to draw on social science theory and methods to mitigate societal harms in social media AIs.
△ Less
Submitted 14 February, 2024; v1 submitted 25 July, 2023;
originally announced July 2023.
-
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents
Authors:
Mehrad Moradshahi,
Tianhao Shen,
Kalika Bali,
Monojit Choudhury,
Gaël de Chalendar,
Anmol Goel,
Sungkyun Kim,
Prashant Kodali,
Ponnurangam Kumaraguru,
Nasredine Semmar,
Sina J. Semnani,
Jiwon Seo,
Vivek Seshadri,
Manish Shrivastava,
Michael Sun,
Aditya Yadavalli,
Chaobin You,
Deyi Xiong,
Monica S. Lam
Abstract:
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-H…
▽ More
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents.
The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks.
We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language Models
Authors:
Jackie Junrui Yang,
Yingtian Shi,
Yuhan Zhang,
Karina Li,
Daniel Wan Rosli,
Anisha Jain,
Shuning Zhang,
Tianshi Li,
James A. Landay,
Monica S. Lam
Abstract:
By combining voice and touch interactions, multimodal interfaces can surpass the efficiency of either modality alone. Traditional multimodal frameworks require laborious developer work to support rich multimodal commands where the user's multimodal command involves possibly exponential combinations of actions/function invocations. This paper presents ReactGenie, a programming framework that better…
▽ More
By combining voice and touch interactions, multimodal interfaces can surpass the efficiency of either modality alone. Traditional multimodal frameworks require laborious developer work to support rich multimodal commands where the user's multimodal command involves possibly exponential combinations of actions/function invocations. This paper presents ReactGenie, a programming framework that better separates multimodal input from the computational model to enable developers to create efficient and capable multimodal interfaces with ease. ReactGenie translates multimodal user commands into NLPL (Natural Language Programming Language), a programming language we created, using a neural semantic parser based on large-language models. The ReactGenie runtime interprets the parsed NLPL and composes primitives in the computational model to implement complex user commands. As a result, ReactGenie allows easy implementation and unprecedented richness in commands for end-users of multimodal apps. Our evaluation showed that 12 developers can learn and build a nontrivial ReactGenie application in under 2.5 hours on average. In addition, compared with a traditional GUI, end-users can complete tasks faster and with less task load using ReactGenie apps.
△ Less
Submitted 2 May, 2024; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Revisiting the Reliability of Psychological Scales on Large Language Models
Authors:
Jen-tse Huang,
Wenxuan Wang,
Man Ho Lam,
Eric John Li,
Wenxiang Jiao,
Michael R. Lyu
Abstract:
Recent research has extended beyond assessing the performance of Large Language Models (LLMs) to examining their characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy area in this context. However, the suitability of employing psychological scales, i…
▽ More
Recent research has extended beyond assessing the performance of Large Language Models (LLMs) to examining their characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy area in this context. However, the suitability of employing psychological scales, initially devised for humans, on LLMs is a matter of ongoing debate. Our study aims to determine the reliability of applying personality assessments to LLMs, explicitly investigating whether LLMs demonstrate consistent personality traits. Analyzing responses under 2,500 settings reveals that gpt-3.5-turbo shows consistency in responses to the Big Five Inventory, indicating a high degree of reliability. Furthermore, our research explores the potential of gpt-3.5-turbo to emulate diverse personalities and represent various groups, which is a capability increasingly sought after in social sciences for substituting human participants with LLMs to reduce costs. Our findings reveal that LLMs have the potential to represent different personalities with specific prompt instructions. By shedding light on the personalization of LLMs, our study endeavors to pave the way for future explorations in this field. We have made our experimental results and the corresponding code openly accessible via https://github.com/CUHK-ARISE/LLMPersonality.
△ Less
Submitted 28 December, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model
Authors:
Xiang Li,
Songxiang Liu,
Max W. Y. Lam,
Zhiyong Wu,
Chao Weng,
Helen Meng
Abstract:
Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in makin…
▽ More
Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in making diverse predictions. Thus, we propose a novel prosody predictor based on the denoising diffusion probabilistic model to take advantage of its high-quality generative modeling and training stability. Experiment results confirm that the proposed prosody predictor outperforms the deterministic baseline on both the expressiveness and diversity of prediction results with even fewer network parameters.
△ Less
Submitted 7 October, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
Efficient Neural Music Generation
Authors:
Max W. Y. Lam,
Qiao Tian,
Tang Li,
Zongyu Yin,
Siyuan Feng,
Ming Tu,
Yuliang Ji,
Rui Xia,
Mingbo Ma,
Xuchen Song,
Jitong Chen,
Yuping Wang,
Yuxuan Wang
Abstract:
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real…
▽ More
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation.
Our samples are available at https://Efficient-MeLoDy.github.io/.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia
Authors:
Sina J. Semnani,
Violet Z. Yao,
Heidi C. Zhang,
Monica S. Lam
Abstract:
This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus.
WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engagi…
▽ More
This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus.
WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment.
Using a novel hybrid human-and-LLM evaluation methodology, we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM.
WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments.
△ Less
Submitted 27 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata
Authors:
Silei Xu,
Shicheng Liu,
Theo Culhane,
Elizaveta Pertseva,
Meng-Hsi Wu,
Sina J. Semnani,
Monica S. Lam
Abstract:
While large language models (LLMs) can answer many questions correctly, they can also hallucinate and give wrong answers. Wikidata, with its over 12 billion facts, can be used to ground LLMs to improve their factuality. This paper presents WikiWebQuestions, a high-quality question answering benchmark for Wikidata. Ported over from WebQuestions for Freebase, it consists of real-world data with SPAR…
▽ More
While large language models (LLMs) can answer many questions correctly, they can also hallucinate and give wrong answers. Wikidata, with its over 12 billion facts, can be used to ground LLMs to improve their factuality. This paper presents WikiWebQuestions, a high-quality question answering benchmark for Wikidata. Ported over from WebQuestions for Freebase, it consists of real-world data with SPARQL annotation. This paper presents a few-shot sequence-to-sequence semantic parser for Wikidata. We modify SPARQL to use the unique domain and property names instead of their IDs. We train the parser to use either the results from an entity linker or mentions in the query. We fine-tune LLaMA by adding the few-shot training data to that used to fine-tune Alpaca. Our experimental results demonstrate the effectiveness of this methodology, establishing a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By pairing our semantic parser with GPT-3, we combine verifiable results with qualified GPT-3 guesses to provide useful answers to 96% of the questions in dev. We also show that our method outperforms the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.
△ Less
Submitted 5 November, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Model Sketching: Centering Concepts in Early-Stage Machine Learning Model Design
Authors:
Michelle S. Lam,
Zixian Ma,
Anne Li,
Izequiel Freitas,
Dakuo Wang,
James A. Landay,
Michael S. Bernstein
Abstract:
Machine learning practitioners often end up tunneling on low-level technical details like model architectures and performance metrics. Could early model development instead focus on high-level questions of which factors a model ought to pay attention to? Inspired by the practice of sketching in design, which distills ideas to their minimal representation, we introduce model sketching: a technical…
▽ More
Machine learning practitioners often end up tunneling on low-level technical details like model architectures and performance metrics. Could early model development instead focus on high-level questions of which factors a model ought to pay attention to? Inspired by the practice of sketching in design, which distills ideas to their minimal representation, we introduce model sketching: a technical framework for iteratively and rapidly authoring functional approximations of a machine learning model's decision-making logic. Model sketching refocuses practitioner attention on composing high-level, human-understandable concepts that the model is expected to reason over (e.g., profanity, racism, or sarcasm in a content moderation task) using zero-shot concept instantiation. In an evaluation with 17 ML practitioners, model sketching reframed thinking from implementation to higher-level exploration, prompted iteration on a broader range of model designs, and helped identify gaps in the problem formulation$\unicode{x2014}$all in a fraction of the time ordinarily required to build a model.
△ Less
Submitted 5 March, 2023;
originally announced March 2023.
-
Design and Mechanics of Cable-Driven Rolling Diaphragm Transmission for High-Transparency Robotic Motion
Authors:
Hoi Man Lam,
W. Jared Walker,
Lucas Jonasch,
Dimitri Schreiber,
Michael C. Yip
Abstract:
Applications of rolling diaphragm transmissions for medical and teleoperated robotics are of great interest, due to the low friction of rolling diaphragms combined with the power density and stiffness of hydraulic transmissions. However, the stiffness-enabling pressure preloads can form a tradeoff against bearing loading in some rolling diaphragm layouts, and transmission setup can be difficult. U…
▽ More
Applications of rolling diaphragm transmissions for medical and teleoperated robotics are of great interest, due to the low friction of rolling diaphragms combined with the power density and stiffness of hydraulic transmissions. However, the stiffness-enabling pressure preloads can form a tradeoff against bearing loading in some rolling diaphragm layouts, and transmission setup can be difficult. Utilization of cable drives compliment the rolling diaphragm transmission's advantages, but maintaining cable tension is crucial for optimal and consistent performance. In this paper, a coaxial opposed rolling diaphragm layout with cable drive and an electronic transmission control system are investigated, with a focus on system reliability and scalability. Mechanical features are proposed which enable force balancing, decoupling of transmission pressure from bearing loads, and maintenance of cable tension. Key considerations and procedures for automation of transmission setup, phasing, and operation are also presented. We also present an analysis of system stiffness to identify key compliance contributors, and conduct experiments to validate prototype design performance.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Zero and Few-Shot Localization of Task-Oriented Dialogue Agents with a Distilled Representation
Authors:
Mehrad Moradshahi,
Sina J. Semnani,
Monica S. Lam
Abstract:
Task-oriented Dialogue (ToD) agents are mostly limited to a few widely-spoken languages, mainly due to the high cost of acquiring training data for each language. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent. We propose automatic methods that use…
▽ More
Task-oriented Dialogue (ToD) agents are mostly limited to a few widely-spoken languages, mainly due to the high cost of acquiring training data for each language. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent. We propose automatic methods that use ToD training data in a source language to build a high-quality functioning dialogue agent in another target language that has no training data (i.e. zero-shot) or a small training set (i.e. few-shot). Unlike most prior work in cross-lingual ToD that only focuses on Dialogue State Tracking (DST), we build an end-to-end agent.
We show that our approach closes the accuracy gap between few-shot and existing full-shot methods for ToD agents. We achieve this by (1) improving the dialogue data representation, (2) improving entity-aware machine translation, and (3) automatic filtering of noisy translations.
We evaluate our approach on the recent bilingual dialogue dataset BiToD. In Chinese to English transfer, in the zero-shot setting, our method achieves 46.7% and 22.0% in Task Success Rate (TSR) and Dialogue Success Rate (DSR) respectively. In the few-shot setting where 10% of the data in the target language is used, we improve the state-of-the-art by 15.2% and 14.0%, coming within 5% of full-shot training.
△ Less
Submitted 18 February, 2023;
originally announced February 2023.
-
GPU-based Private Information Retrieval for On-Device Machine Learning Inference
Authors:
Maximilian Lam,
Jeff Johnson,
Wenjie Xiong,
Kiwan Maeng,
Udit Gupta,
Yang Li,
Liangzhen Lai,
Ilias Leontiadis,
Minsoo Rhu,
Hsien-Hsin S. Lee,
Vijay Janapa Reddi,
Gu-Yeon Wei,
David Brooks,
G. Edward Suh
Abstract:
On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the or…
▽ More
On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than $20 \times$ over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over $5 \times$ additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100,000$ queries per second -- a $>100 \times$ throughput improvement over a CPU-based baseline -- while maintaining model accuracy.
△ Less
Submitted 25 September, 2023; v1 submitted 25 January, 2023;
originally announced January 2023.
-
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
Authors:
Rongjie Huang,
Max W. Y. Lam,
Jun Wang,
Dan Su,
Dong Yu,
Yi Ren,
Zhou Zhao
Abstract:
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of div…
▽ More
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Learning to Deblur using Light Field Generated and Real Defocus Images
Authors:
Lingyan Ruan,
Bin Chen,
Jizhou Li,
Miuling Lam
Abstract:
Defocus deblurring is a challenging task due to the spatially varying nature of defocus blur. While deep learning approach shows great promise in solving image restoration problems, defocus deblurring demands accurate training data that consists of all-in-focus and defocus image pairs, which is difficult to collect. Naive two-shot capturing cannot achieve pixel-wise correspondence between the defo…
▽ More
Defocus deblurring is a challenging task due to the spatially varying nature of defocus blur. While deep learning approach shows great promise in solving image restoration problems, defocus deblurring demands accurate training data that consists of all-in-focus and defocus image pairs, which is difficult to collect. Naive two-shot capturing cannot achieve pixel-wise correspondence between the defocused and all-in-focus image pairs. Synthetic aperture of light fields is suggested to be a more reliable way to generate accurate image pairs. However, the defocus blur generated from light field data is different from that of the images captured with a traditional digital camera. In this paper, we propose a novel deep defocus deblurring network that leverages the strength and overcomes the shortcoming of light fields. We first train the network on a light field-generated dataset for its highly accurate image correspondence. Then, we fine-tune the network using feature loss on another dataset collected by the two-shot method to alleviate the differences between the defocus blur exists in the two domains. This strategy is proved to be highly effective and able to achieve the state-of-the-art performance both quantitatively and qualitatively on multiple test sets. Extensive ablation studies have been conducted to analyze the effect of each network module to the final performance.
△ Less
Submitted 1 April, 2022;
originally announced April 2022.
-
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
Authors:
Max W. Y. Lam,
Jun Wang,
Dan Su,
Dong Yu
Abstract:
Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surro…
▽ More
Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave). We release our code at https://github.com/tencent-ailab/bddm.
△ Less
Submitted 25 March, 2022;
originally announced March 2022.
-
ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues
Authors:
Monica S. Lam,
Giovanni Campagna,
Mehrad Moradshahi,
Sina J. Semnani,
Silei Xu
Abstract:
Task-oriented conversational agents rely on semantic parsers to translate natural language to formal representations. In this paper, we propose the design and rationale of the ThingTalk formal representation, and how the design improves the development of transactional task-oriented agents.
ThingTalk is built on four core principles: (1) representing user requests directly as executable statemen…
▽ More
Task-oriented conversational agents rely on semantic parsers to translate natural language to formal representations. In this paper, we propose the design and rationale of the ThingTalk formal representation, and how the design improves the development of transactional task-oriented agents.
ThingTalk is built on four core principles: (1) representing user requests directly as executable statements, covering all the functionality of the agent, (2) representing dialogues formally and succinctly to support accurate contextual semantic parsing, (3) standardizing types and interfaces to maximize reuse between agents, and (4) allowing multiple, independently-developed agents to be composed in a single virtual assistant. ThingTalk is developed as part of the Genie Framework that allows developers to quickly build transactional agents given a database and APIs.
We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST. Compared to the others, the ThingTalk design is both more general and more cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and associated tools yields a new state of the art accuracy of 79% turn-by-turn.
△ Less
Submitted 23 March, 2022;
originally announced March 2022.
-
Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference
Authors:
Maximilian Lam,
Michael Mitzenmacher,
Vijay Janapa Reddi,
Gu-Yeon Wei,
David Brooks
Abstract:
Multiparty computation approaches to secure neural network inference commonly rely on garbled circuits for securely executing nonlinear activation functions. However, garbled circuits require excessive communication between server and client, impose significant storage overheads, and incur large runtime penalties. To reduce these costs, we propose an alternative to garbled circuits: Tabula, an alg…
▽ More
Multiparty computation approaches to secure neural network inference commonly rely on garbled circuits for securely executing nonlinear activation functions. However, garbled circuits require excessive communication between server and client, impose significant storage overheads, and incur large runtime penalties. To reduce these costs, we propose an alternative to garbled circuits: Tabula, an algorithm based on secure lookup tables. Our approach precomputes lookup tables during an offline phase that contains the result of all possible nonlinear function calls. Because these tables incur exponential storage costs in the number of operands and the precision of the input values, we use quantization to reduce these storage costs to make this approach practical. This enables an online phase where securely computing the result of a nonlinear function requires just a single round of communication, with communication cost equal to twice the number of bits of the input to the nonlinear function. In practice our approach costs 2 bytes of communication per nonlinear function call in the online phase. Compared to garbled circuits with 8-bit quantized inputs, when computing individual nonlinear functions during the online phase, experiments show Tabula with 8-bit activations uses between $280$-$560 \times$ less communication, is over $100\times$ faster, and uses a comparable (within a factor of 2) amount of storage; compared against other state-of-the-art protocols Tabula achieves greater than $40\times$ communication reduction. This leads to significant performance gains over garbled circuits with quantized inputs during the online phase of secure inference of neural networks: Tabula reduces end-to-end inference communication by up to $9 \times$ and achieves an end-to-end inference speedup of up to $50 \times$, while imposing comparable storage and offline preprocessing costs.
△ Less
Submitted 16 June, 2024; v1 submitted 5 March, 2022;
originally announced March 2022.
-
Jury Learning: Integrating Dissenting Voices into Machine Learning Models
Authors:
Mitchell L. Gordon,
Michelle S. Lam,
Joon Sung Park,
Kayur Patel,
Jeffrey T. Hancock,
Tatsunori Hashimoto,
Michael S. Bernstein
Abstract:
Whose labels should a machine learning (ML) algorithm learn to emulate? For ML tasks ranging from online comment toxicity to misinformation detection to medical diagnosis, different groups in society may have irreconcilable disagreements about ground truth labels. Supervised ML today resolves these label disagreements implicitly using majority vote, which overrides minority groups' labels. We intr…
▽ More
Whose labels should a machine learning (ML) algorithm learn to emulate? For ML tasks ranging from online comment toxicity to misinformation detection to medical diagnosis, different groups in society may have irreconcilable disagreements about ground truth labels. Supervised ML today resolves these label disagreements implicitly using majority vote, which overrides minority groups' labels. We introduce jury learning, a supervised ML approach that resolves these disagreements explicitly through the metaphor of a jury: defining which people or groups, in what proportion, determine the classifier's prediction. For example, a jury learning model for online toxicity might centrally feature women and Black jurors, who are commonly targets of online harassment. To enable jury learning, we contribute a deep learning architecture that models every annotator in a dataset, samples from annotators' models to populate the jury, then runs inference to classify. Our architecture enables juries that dynamically adapt their composition, explore counterfactuals, and visualize dissent.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
Authors:
Daniel Galvez,
Greg Diamos,
Juan Ciro,
Juan Felipe Cerón,
Keith Achorn,
Anjali Gopi,
David Kanter,
Maximilian Lam,
Mark Mazumder,
Vijay Janapa Reddi
Abstract:
The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection…
▽ More
The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons's sponsorship.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues
Authors:
Mehrad Moradshahi,
Victoria Tsai,
Giovanni Campagna,
Monica S. Lam
Abstract:
Robust state tracking for task-oriented dialogue systems currently remains restricted to a few popular languages. This paper shows that given a large-scale dialogue data set in one language, we can automatically produce an effective semantic parser for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of s…
▽ More
Robust state tracking for task-oriented dialogue systems currently remains restricted to a few popular languages. This paper shows that given a large-scale dialogue data set in one language, we can automatically produce an effective semantic parser for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values and eliminate costly human supervision used in previous benchmarks. We also propose a new contextual semantic parsing model, which encodes the formal slots and values, and only the last agent and user utterances. We show that the succinct representation reduces the compounding effect of translation errors, without harming the accuracy in practice.
We evaluate our approach on several dialogue state tracking benchmarks. On RiSAWOZ, CrossWOZ, CrossWOZ-EN, and MultiWOZ-ZH datasets we improve the state of the art by 11%, 17%, 20%, and 0.3% in joint goal accuracy. We present a comprehensive error analysis for all three datasets showing erroneous annotations can lead to misguided judgments on the quality of the model.
Finally, we present RiSAWOZ English and German datasets, created using our translation methodology. On these datasets, accuracy is within 11% of the original showing that high-accuracy multilingual dialogue datasets are possible without relying on expensive human annotations. We release our datasets and software open source.
△ Less
Submitted 18 February, 2023; v1 submitted 3 November, 2021;
originally announced November 2021.
-
Bilateral Denoising Diffusion Models
Authors:
Max W. Y. Lam,
Jun Wang,
Rongjie Huang,
Dan Su,
Dong Yu
Abstract:
Denoising diffusion probabilistic models (DDPMs) have emerged as competitive generative models yet brought challenges to efficient sampling. In this paper, we propose novel bilateral denoising diffusion models (BDDMs), which take significantly fewer steps to generate high-quality samples. From a bilateral modeling objective, BDDMs parameterize the forward and reverse processes with a score network…
▽ More
Denoising diffusion probabilistic models (DDPMs) have emerged as competitive generative models yet brought challenges to efficient sampling. In this paper, we propose novel bilateral denoising diffusion models (BDDMs), which take significantly fewer steps to generate high-quality samples. From a bilateral modeling objective, BDDMs parameterize the forward and reverse processes with a score network and a scheduling network, respectively. We show that a new lower bound tighter than the standard evidence lower bound can be derived as a surrogate objective for training the two networks. In particular, BDDMs are efficient, simple-to-train, and capable of further improving any pre-trained DDPM by optimizing the inference noise schedules. Our experiments demonstrated that BDDMs can generate high-fidelity samples with as few as 3 sampling steps and produce comparable or even higher quality samples than DDPMs using 1000 steps with only 16 sampling steps (a 62x speedup).
△ Less
Submitted 14 September, 2021; v1 submitted 26 August, 2021;
originally announced August 2021.
-
ARCSnake: Reconfigurable Snake-Like Robot with Archimedean Screw Propulsion for Multi-Domain Mobility
Authors:
Florian Richter,
Peter V. Gavrilov,
Hoi Man Lam,
Amir Degani,
Michael C. Yip
Abstract:
Exploring and navigating in extreme environments, such as caves, oceans, and planetary bodies, are often too hazardous for humans, and as such, robots are possible surrogates. These robots are met with significant locomotion challenges that require traversing a wide range of surface roughnesses and topologies. Previous locomotion strategies, involving wheels or ambulatory motion, such as snake pla…
▽ More
Exploring and navigating in extreme environments, such as caves, oceans, and planetary bodies, are often too hazardous for humans, and as such, robots are possible surrogates. These robots are met with significant locomotion challenges that require traversing a wide range of surface roughnesses and topologies. Previous locomotion strategies, involving wheels or ambulatory motion, such as snake platforms, have success on specific surfaces but fail in others which could be detrimental in exploration and navigation missions. In this paper, we present a novel approach that combines snake-like robots with an Archimedean screw locomotion mechanism to provide multiple, effective mobility strategies in a large range of environments, including those that are difficult to traverse for wheeled and ambulatory robots. This work develops a robotic system called ARCSnake to demonstrate this locomotion principle and tested it in a variety of different terrains and environments in order to prove its controllable, multi-domain, navigation capabilities. These tests show a wide breadth of scenarios that ARCSnake can handle, hence demonstrating its ability to traverse through extreme terrains.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
Gradient Disaggregation: Breaking Privacy in Federated Learning by Reconstructing the User Participant Matrix
Authors:
Maximilian Lam,
Gu-Yeon Wei,
David Brooks,
Vijay Janapa Reddi,
Michael Mitzenmacher
Abstract:
We show that aggregated model updates in federated learning may be insecure. An untrusted central server may disaggregate user updates from sums of updates across participants given repeated observations, enabling the server to recover privileged information about individual users' private training data via traditional gradient inference attacks. Our method revolves around reconstructing participa…
▽ More
We show that aggregated model updates in federated learning may be insecure. An untrusted central server may disaggregate user updates from sums of updates across participants given repeated observations, enabling the server to recover privileged information about individual users' private training data via traditional gradient inference attacks. Our method revolves around reconstructing participant information (e.g: which rounds of training users participated in) from aggregated model updates by leveraging summary information from device analytics commonly used to monitor, debug, and manage federated learning systems. Our attack is parallelizable and we successfully disaggregate user updates on settings with up to thousands of participants. We quantitatively and qualitatively demonstrate significant improvements in the capability of various inference attacks on the disaggregated updates. Our attack enables the attribution of learned properties to individual users, violating anonymity, and shows that a determined central server may undermine the secure aggregation protocol to break individual users' data privacy in federated learning.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition
Authors:
Max W. Y. Lam,
Jun Wang,
Chao Weng,
Dan Su,
Dong Yu
Abstract:
End-to-end speech recognition generally uses hand-engineered acoustic features as input and excludes the feature extraction module from its joint optimization. To extract learnable and adaptive features and mitigate information loss, we propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input. We observe improved ASR performanc…
▽ More
End-to-end speech recognition generally uses hand-engineered acoustic features as input and excludes the feature extraction module from its joint optimization. To extract learnable and adaptive features and mitigate information loss, we propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input. We observe improved ASR performance and robustness by applying GALR on different window lengths to aggregate fine-grain temporal information into multi-scale acoustic features. Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours. With faster speed and comparable model size, our proposed multi-scale GALR waveform encoder achieved consistent character error rate reductions (CERRs) from 7.9% to 28.1% relative over strong baselines, including Conformer and TDNN-Conformer. In particular, our approach demonstrated notable robustness than the traditional handcrafted features and outperformed the baseline MFCC-based TDNN-Conformer model by a 15.2% CERR on a music-mixed real-world speech test set.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Widening Access to Applied Machine Learning with TinyML
Authors:
Vijay Janapa Reddi,
Brian Plancher,
Susan Kennedy,
Laurence Moroney,
Pete Warden,
Anant Agarwal,
Colby Banbury,
Massimo Banzi,
Matthew Bennett,
Benjamin Brown,
Sharad Chitlangia,
Radhika Ghosal,
Sarah Grafman,
Rupert Jaeger,
Srivatsan Krishnan,
Maximilian Lam,
Daniel Leiker,
Cara Mann,
Mark Mazumder,
Dominic Pajak,
Dhilan Ramaprasad,
J. Evan Smith,
Matthew Stewart,
Dustin Tingley
Abstract:
Broadening access to both computational and educational resources is critical to diffusing machine-learning (ML) innovation. However, today, most ML resources and experts are siloed in a few countries and organizations. In this paper, we describe our pedagogical approach to increasing access to applied ML through a massive open online course (MOOC) on Tiny Machine Learning (TinyML). We suggest tha…
▽ More
Broadening access to both computational and educational resources is critical to diffusing machine-learning (ML) innovation. However, today, most ML resources and experts are siloed in a few countries and organizations. In this paper, we describe our pedagogical approach to increasing access to applied ML through a massive open online course (MOOC) on Tiny Machine Learning (TinyML). We suggest that TinyML, ML on resource-constrained embedded devices, is an attractive means to widen access because TinyML both leverages low-cost and globally accessible hardware, and encourages the development of complete, self-contained applications, from data collection to deployment. To this end, a collaboration between academia (Harvard University) and industry (Google) produced a four-part MOOC that provides application-oriented instruction on how to develop solutions using TinyML. The series is openly available on the edX MOOC platform, has no prerequisites beyond basic programming, and is designed for learners from a global variety of backgrounds. It introduces pupils to real-world applications, ML algorithms, data-set engineering, and the ethical considerations of these technologies via hands-on programming and deployment of TinyML applications in both the cloud and their own microcontrollers. To facilitate continued learning, community building, and collaboration beyond the courses, we launched a standalone website, a forum, a chat, and an optional course-project competition. We also released the course materials publicly, hoping they will inspire the next generation of ML practitioners and educators and further broaden access to cutting-edge ML technologies.
△ Less
Submitted 9 June, 2021; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Grounding Open-Domain Instructions to Automate Web Support Tasks
Authors:
Nancy Xu,
Sam Masling,
Michael Du,
Giovanni Campagna,
Larry Heck,
James Landay,
Monica S Lam
Abstract:
Grounding natural language instructions on the web to perform previously unseen tasks enables accessibility and automation. We introduce a task and dataset to train AI agents from open-domain, step-by-step instructions originally written for people. We build RUSS (Rapid Universal Support Service) to tackle this problem. RUSS consists of two models: First, a BERT-LSTM with pointers parses instructi…
▽ More
Grounding natural language instructions on the web to perform previously unseen tasks enables accessibility and automation. We introduce a task and dataset to train AI agents from open-domain, step-by-step instructions originally written for people. We build RUSS (Rapid Universal Support Service) to tackle this problem. RUSS consists of two models: First, a BERT-LSTM with pointers parses instructions to ThingTalk, a domain-specific language we design for grounding natural language on the web. Then, a grounding model retrieves the unique IDs of any webpage elements requested in ThingTalk. RUSS may interact with the user through a dialogue (e.g. ask for an address) or execute a web operation (e.g. click a button) inside the web runtime. To augment training, we synthesize natural language instructions mapped to ThingTalk. Our dataset consists of 80 different customer service problems from help websites, with a total of 741 step-by-step instructions and their corresponding actions. RUSS achieves 76.7% end-to-end accuracy predicting agent actions from single instructions. It outperforms state-of-the-art models that directly map instructions to actions without ThingTalk. Our user study shows that RUSS is preferred by actual users over web navigation.
△ Less
Submitted 4 April, 2021; v1 submitted 30 March, 2021;
originally announced March 2021.