Search | arXiv e-print repository

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

Authors: Viet Anh Trinh, Rosy Southwell, Yiwen Guan, Xinlu He, Zhiyong Wang, Jacob Whitehill

Abstract: Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder… ▽ More Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations. △ Less

Submitted 25 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

arXiv:2310.01132 [pdf, other]

Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback

Authors: Jacob Whitehill, Jennifer LoCasale-Crouch

Abstract: With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching, we explore how Large Language Models (LLMs) can be used to estimate ``Instructional Support'' domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol. We design a machine learning architecture that uses either zero-shot prompting of Meta's Llama2… ▽ More With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching, we explore how Large Language Models (LLMs) can be used to estimate ``Instructional Support'' domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol. We design a machine learning architecture that uses either zero-shot prompting of Meta's Llama2, and/or a classic Bag of Words (BoW) model, to classify individual utterances of teachers' speech (transcribed automatically using OpenAI's Whisper) for the presence of Instructional Support. Then, these utterance-level judgments are aggregated over a 15-min observation session to estimate a global CLASS score. Experiments on two CLASS-coded datasets of toddler and pre-kindergarten classrooms indicate that (1) automatic CLASS Instructional Support estimation accuracy using the proposed method (Pearson $R$ up to $0.48$) approaches human inter-rater reliability (up to $R=0.55$); (2) LLMs generally yield slightly greater accuracy than BoW for this task, though the best models often combined features extracted from both LLM and BoW; and (3) for classifying individual utterances, there is still room for improvement of automated methods compared to human-level judgments. Finally, (4) we illustrate how the model's outputs can be visualized at the utterance level to provide teachers with explainable feedback on which utterances were most positively or negatively correlated with specific CLASS dimensions. △ Less

Submitted 16 April, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2109.04160 [pdf, other]

doi 10.1016/j.patcog.2023.109829

Compositional Clustering: Applications to Multi-Label Object Recognition and Speaker Identification

Authors: Zeqian Li, Xinlu He, Jacob Whitehill

Abstract: We consider a novel clustering task in which clusters can have compositional relationships, e.g., one cluster contains images of rectangles, one contains images of circles, and a third (compositional) cluster contains images with both objects. In contrast to hierarchical clustering in which a parent cluster represents the intersection of properties of the child clusters, our problem is about findi… ▽ More We consider a novel clustering task in which clusters can have compositional relationships, e.g., one cluster contains images of rectangles, one contains images of circles, and a third (compositional) cluster contains images with both objects. In contrast to hierarchical clustering in which a parent cluster represents the intersection of properties of the child clusters, our problem is about finding compositional clusters that represent the union of the properties of the constituent clusters. This task is motivated by recently developed few-shot learning and embedding models can distinguish the label sets, not just the individual labels, assigned to the examples. We propose three new algorithms -- Compositional Affinity Propagation (CAP), Compositional k-means (CKM), and Greedy Compositional Reassignment (GCR) -- that can partition examples into coherent groups and infer the compositional structure among them. We show promising results, compared to popular algorithms such as Gaussian mixtures, Fuzzy c-means, and Agglomerative Clustering, on the OmniGlot and LibriSpeech datasets. Our work has applications to open-world multi-label object recognition and speaker identification & diarization with simultaneous speech from multiple speakers. △ Less

Submitted 21 July, 2023; v1 submitted 9 September, 2021; originally announced September 2021.

arXiv:2103.03862 [pdf, other]

Harnessing Geometric Constraints from Emotion Labels to improve Face Verification

Authors: Anand Ramakrishnan, Minh Pham, Jacob Whitehill

Abstract: For the task of face verification, we explore the utility of harnessing auxiliary facial emotion labels to impose explicit geometric constraints on the embedding space when training deep embedding models. We introduce several novel loss functions that, in conjunction with a standard Triplet Loss [43], or ArcFace loss [10], provide geometric constraints on the embedding space; the labels for our lo… ▽ More For the task of face verification, we explore the utility of harnessing auxiliary facial emotion labels to impose explicit geometric constraints on the embedding space when training deep embedding models. We introduce several novel loss functions that, in conjunction with a standard Triplet Loss [43], or ArcFace loss [10], provide geometric constraints on the embedding space; the labels for our loss functions can be provided using either manually annotated or automatically detected auxiliary emotion labels. Our method is implemented purely in terms of the loss function and does not require any changes to the neural network backbone of the embedding function. △ Less

Submitted 22 July, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

Comments: 8 pages, 3 figures, 2 tables

arXiv:2010.11803 [pdf, other]

Compositional embedding models for speaker identification and diarization with simultaneous speech from 2+ speakers

Authors: Zeqian Li, Jacob Whitehill

Abstract: We propose a new method for speaker diarization that can handle overlapping speech with 2+ people. Our method is based on compositional embeddings [1]: Like standard speaker embedding methods such as x-vector [2], compositional embedding models contain a function f that separates speech from different speakers. In addition, they include a composition function g to compute set-union operations in t… ▽ More We propose a new method for speaker diarization that can handle overlapping speech with 2+ people. Our method is based on compositional embeddings [1]: Like standard speaker embedding methods such as x-vector [2], compositional embedding models contain a function f that separates speech from different speakers. In addition, they include a composition function g to compute set-union operations in the embedding space so as to infer the set of speakers within the input audio. In an experiment on multi-person speaker identification using synthesized LibriSpeech data, the proposed method outperforms traditional embedding methods that are only trained to separate single speakers (not speaker sets). In a speaker diarization experiment on the AMI Headset Mix corpus, we achieve state-of-the-art accuracy (DER=22.93%), slightly higher than the previous best result (23.82% from [3]). △ Less

Submitted 10 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

arXiv:2005.09525

doi 10.1109/TAFFC.2021.3059209

Toward Automated Classroom Observation: Multimodal Machine Learning to Estimate CLASS Positive Climate and Negative Climate

Authors: Anand Ramakrishnan, Brian Zylich, Erin Ottmar, Jennifer LoCasale-Crouch, Jacob Whitehill

Abstract: In this work we present a multi-modal machine learning-based system, which we call ACORN, to analyze videos of school classrooms for the Positive Climate (PC) and Negative Climate (NC) dimensions of the CLASS observation protocol that is widely used in educational research. ACORN uses convolutional neural networks to analyze spectral audio features, the faces of teachers and students, and the pixe… ▽ More In this work we present a multi-modal machine learning-based system, which we call ACORN, to analyze videos of school classrooms for the Positive Climate (PC) and Negative Climate (NC) dimensions of the CLASS observation protocol that is widely used in educational research. ACORN uses convolutional neural networks to analyze spectral audio features, the faces of teachers and students, and the pixels of each image frame, and then integrates this information over time using Temporal Convolutional Networks. The audiovisual ACORN's PC and NC predictions have Pearson correlations of $0.55$ and $0.63$ with ground-truth scores provided by expert CLASS coders on the UVA Toddler dataset (cross-validation on $n=300$ 15-min video segments), and a purely auditory ACORN predicts PC and NC with correlations of $0.36$ and $0.41$ on the MET dataset (test set of $n=2000$ videos segments). These numbers are similar to inter-coder reliability of human coders. Finally, using Graph Convolutional Networks we make early strides (AUC=$0.70$) toward predicting the specific moments (45-90sec clips) when the PC is particularly weak/strong. Our findings inform the design of automatic classroom observation and also more general video activity recognition and summary recognition systems. △ Less

Submitted 23 July, 2021; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: The authors discovered that the results are not reproducible

Journal ref: IEEE Transactions on Affective Computing, 2021

arXiv:2002.05242 [pdf, other]

Leveraging Affect Transfer Learning for Behavior Prediction in an Intelligent Tutoring System

Authors: Nataniel Ruiz, Hao Yu, Danielle A. Allessio, Mona Jalal, Ajjen Joshi, Thomas Murray, John J. Magee, Jacob R. Whitehill, Vitaly Ablavsky, Ivon Arroyo, Beverly P. Woolf, Stan Sclaroff, Margrit Betke

Abstract: In this work, we propose a video-based transfer learning approach for predicting problem outcomes of students working with an intelligent tutoring system (ITS). By analyzing a student's face and gestures, our method predicts the outcome of a student answering a problem in an ITS from a video feed. Our work is motivated by the reasoning that the ability to predict such outcomes enables tutoring sys… ▽ More In this work, we propose a video-based transfer learning approach for predicting problem outcomes of students working with an intelligent tutoring system (ITS). By analyzing a student's face and gestures, our method predicts the outcome of a student answering a problem in an ITS from a video feed. Our work is motivated by the reasoning that the ability to predict such outcomes enables tutoring systems to adjust interventions, such as hints and encouragement, and to ultimately yield improved student learning. We collected a large labeled dataset of student interactions with an intelligent online math tutor consisting of 68 sessions, where 54 individual students solved 2,749 problems. The dataset is public and available at https://www.cs.bu.edu/faculty/betke/research/learning/ . Working with this dataset, our transfer-learning challenge was to design a representation in the source domain of pictures obtained "in the wild" for the task of facial expression analysis, and transferring this learned representation to the task of human behavior prediction in the domain of webcam videos of students in a classroom environment. We developed a novel facial affect representation and a user-personalized training scheme that unlocks the potential of this representation. We designed several variants of a recurrent neural network that models the temporal structure of video sequences of students solving math problems. Our final model, named ATL-BP for Affect Transfer Learning for Behavior Prediction, achieves a relative increase in mean F-score of 50% over the state-of-the-art method on this new dataset. △ Less

Submitted 8 April, 2022; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: Published at IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2021 - Best Poster Award (4% award rate)

arXiv:2002.04193 [pdf, other]

Compositional Embeddings for Multi-Label One-Shot Learning

Authors: Zeqian Li, Michael C. Mozer, Jacob Whitehill

Abstract: We present a compositional embedding framework that infers not just a single class per input image, but a set of classes, in the setting of one-shot learning. Specifically, we propose and evaluate several novel models consisting of (1) an embedding function f trained jointly with a "composition" function g that computes set union operations between the classes encoded in two embedding vectors; and… ▽ More We present a compositional embedding framework that infers not just a single class per input image, but a set of classes, in the setting of one-shot learning. Specifically, we propose and evaluate several novel models consisting of (1) an embedding function f trained jointly with a "composition" function g that computes set union operations between the classes encoded in two embedding vectors; and (2) embedding f trained jointly with a "query" function h that computes whether the classes encoded in one embedding subsume the classes encoded in another embedding. In contrast to prior work, these models must both perceive the classes associated with the input examples and encode the relationships between different class label sets, and they are trained using only weak one-shot supervision consisting of the label-set relationships among training examples. Experiments on the OmniGlot, Open Images, and COCO datasets show that the proposed compositional embedding models outperform existing embedding methods. Our compositional embedding models have applications to multi-label object recognition for both one-shot and supervised learning. △ Less

Submitted 13 November, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

arXiv:1812.08255 [pdf, other]

Automatic Classifiers as Scientific Instruments: One Step Further Away from Ground-Truth

Authors: Jacob Whitehill, Anand Ramakrishnan

Abstract: Automatic machine learning-based detectors of various psychological and social phenomena (e.g., emotion, stress, engagement) have great potential to advance basic science. However, when a detector $d$ is trained to approximate an existing measurement tool (e.g., a questionnaire, observation protocol), then care must be taken when interpreting measurements collected using $d$ since they are one ste… ▽ More Automatic machine learning-based detectors of various psychological and social phenomena (e.g., emotion, stress, engagement) have great potential to advance basic science. However, when a detector $d$ is trained to approximate an existing measurement tool (e.g., a questionnaire, observation protocol), then care must be taken when interpreting measurements collected using $d$ since they are one step further removed from the underlying construct. We examine how the accuracy of $d$, as quantified by the correlation $q$ of $d$'s outputs with the ground-truth construct $U$, impacts the estimated correlation between $U$ (e.g., stress) and some other phenomenon $V$ (e.g., academic performance). In particular: (1) We show that if the true correlation between $U$ and $V$ is $r$, then the expected sample correlation, over all vectors $\mathcal{T}^n$ whose correlation with $U$ is $q$, is $qr$. (2) We derive a formula for the probability that the sample correlation (over $n$ subjects) using $d$ is positive given that the true correlation is negative (and vice-versa); this probability can be substantial (around $20-30\%$) for values of $n$ and $q$ that have been used in recent affective computing studies. %We also show that this probability decreases monotonically in $n$ and in $q$. (3) With the goal to reduce the variance of correlations estimated by an automatic detector, we show that training multiple neural networks $d^{(1)},\ldots,d^{(m)}$ using different training architectures and hyperparameters for the same detection task provides only limited ``coverage'' of $\mathcal{T}^n$. △ Less

Submitted 4 May, 2019; v1 submitted 19 December, 2018; originally announced December 2018.

arXiv:1709.02418 [pdf, other]

How Does Knowledge of the AUC Constrain the Set of Possible Ground-truth Labelings?

Authors: Jacob Whitehill

Abstract: Recent work on privacy-preserving machine learning has considered how data-mining competitions such as Kaggle could potentially be "hacked", either intentionally or inadvertently, by using information from an oracle that reports a classifier's accuracy on the test set. For binary classification tasks in particular, one of the most common accuracy metrics is the Area Under the ROC Curve (AUC), and… ▽ More Recent work on privacy-preserving machine learning has considered how data-mining competitions such as Kaggle could potentially be "hacked", either intentionally or inadvertently, by using information from an oracle that reports a classifier's accuracy on the test set. For binary classification tasks in particular, one of the most common accuracy metrics is the Area Under the ROC Curve (AUC), and in this paper we explore the mathematical structure of how the AUC is computed from an n-vector of real-valued "guesses" with respect to the ground-truth labels. We show how knowledge of a classifier's AUC on the test set can constrain the set of possible ground-truth labelings, and we derive an algorithm both to compute the exact number of such labelings and to enumerate efficiently over them. Finally, we provide empirical evidence that, surprisingly, the number of compatible labelings can actually decrease as n grows, until a test set-dependent threshold is reached. △ Less

Submitted 11 September, 2017; v1 submitted 7 September, 2017; originally announced September 2017.

arXiv:1707.01825 [pdf, other]

Climbing the Kaggle Leaderboard by Exploiting the Log-Loss Oracle

Authors: Jacob Whitehill

Abstract: In the context of data-mining competitions (e.g., Kaggle, KDDCup, ILSVRC Challenge), we show how access to an oracle that reports a contestant's log-loss score on the test set can be exploited to deduce the ground-truth of some of the test examples. By applying this technique iteratively to batches of $m$ examples (for small $m$), all of the test labels can eventually be inferred. In this paper, (… ▽ More In the context of data-mining competitions (e.g., Kaggle, KDDCup, ILSVRC Challenge), we show how access to an oracle that reports a contestant's log-loss score on the test set can be exploited to deduce the ground-truth of some of the test examples. By applying this technique iteratively to batches of $m$ examples (for small $m$), all of the test labels can eventually be inferred. In this paper, (1) We demonstrate this attack on the first stage of a recent Kaggle competition (Intel & MobileODT Cancer Screening) and use it to achieve a log-loss of $0.00000$ (and thus attain a rank of #4 out of 848 contestants), without ever training a classifier to solve the actual task. (2) We prove an upper bound on the batch size $m$ as a function of the floating-point resolution of the probability estimates that the contestant submits for the labels. (3) We derive, and demonstrate in simulation, a more flexible attack that can be used even when the oracle reports the accuracy on an unknown (but fixed) subset of the test set's labels. These results underline the importance of evaluating contestants based only on test data that the oracle does not examine. △ Less

Submitted 6 July, 2017; originally announced July 2017.

arXiv:1702.06404 [pdf, other]

Delving Deeper into MOOC Student Dropout Prediction

Authors: Jacob Whitehill, Kiran Mohan, Daniel Seaton, Yigal Rosen, Dustin Tingley

Abstract: In order to obtain reliable accuracy estimates for automatic MOOC dropout predictors, it is important to train and test them in a manner consistent with how they will be used in practice. Yet most prior research on MOOC dropout prediction has measured test accuracy on the same course used for training the classifier, which can lead to overly optimistic accuracy estimates. In order to understand be… ▽ More In order to obtain reliable accuracy estimates for automatic MOOC dropout predictors, it is important to train and test them in a manner consistent with how they will be used in practice. Yet most prior research on MOOC dropout prediction has measured test accuracy on the same course used for training the classifier, which can lead to overly optimistic accuracy estimates. In order to understand better how accuracy is affected by the training+testing regime, we compared the accuracy of a standard dropout prediction architecture (clickstream features + logistic regression) across 4 different training paradigms. Results suggest that (1) training and testing on the same course ("post-hoc") can overestimate accuracy by several percentage points; (2) dropout classifiers trained on proxy labels based on students' persistence are surprisingly competitive with post-hoc training (87.33% versus 90.20% AUC averaged over 8 weeks of 40 HarvardX MOOCs); and (3) classifier performance does not vary significantly with the academic discipline. Finally, we also research new dropout prediction architectures based on deep, fully-connected, feed-forward neural networks and find that (4) networks with as many as 5 hidden layers can statistically significantly increase test accuracy over that of logistic regression. △ Less

Submitted 21 February, 2017; originally announced February 2017.

arXiv:1606.09610 [pdf, other]

A Crowdsourcing Approach To Collecting Tutorial Videos -- Toward Personalized Learning-at-Scale

Authors: Jacob Whitehill, Margo Seltzer

Abstract: We investigated the feasibility of crowdsourcing full-fledged tutorial videos from ordinary people on the Web on how to solve math problems related to logarithms. This kind of approach (a form of learnersourcing) to efficiently collecting tutorial videos and other learning resources could be useful for realizing personalized learning-at-scale, whereby students receive specific learning resources -… ▽ More We investigated the feasibility of crowdsourcing full-fledged tutorial videos from ordinary people on the Web on how to solve math problems related to logarithms. This kind of approach (a form of learnersourcing) to efficiently collecting tutorial videos and other learning resources could be useful for realizing personalized learning-at-scale, whereby students receive specific learning resources -- drawn from a large and diverse set -- that are tailored to their individual and time-varying needs. Results of our study, in which we collected 399 videos from 66 unique "teachers" on Mechanical Turk, suggest that (1) approximately 100 videos -- over $80\%$ of which are mathematically fully correct -- can be crowdsourced per week for \$5/video; (2) the crowdsourced videos exhibit significant diversity in terms of language style, presentation media, and pedagogical approach; (3) the average learning gains (posttest minus pretest score) associated with watching the videos was stat.~sig.~higher than for a control video ($0.105$ versus $0.045$); and (4) the average learning gains ($0.1416$) from watching the best tested crowdsourced videos was comparable to the learning gains ($0.1506$) from watching a popular Khan Academy video on logarithms. △ Less

Submitted 22 April, 2017; v1 submitted 30 June, 2016; originally announced June 2016.

arXiv:1506.01339 [pdf, other]

Exploiting an Oracle that Reports AUC Scores in Machine Learning Contests

Authors: Jacob Whitehill

Abstract: In machine learning contests such as the ImageNet Large Scale Visual Recognition Challenge and the KDD Cup, contestants can submit candidate solutions and receive from an oracle (typically the organizers of the competition) the accuracy of their guesses compared to the ground-truth labels. One of the most commonly used accuracy metrics for binary classification tasks is the Area Under the Receiver… ▽ More In machine learning contests such as the ImageNet Large Scale Visual Recognition Challenge and the KDD Cup, contestants can submit candidate solutions and receive from an oracle (typically the organizers of the competition) the accuracy of their guesses compared to the ground-truth labels. One of the most commonly used accuracy metrics for binary classification tasks is the Area Under the Receiver Operating Characteristics Curve (AUC). In this paper we provide proofs-of-concept of how knowledge of the AUC of a set of guesses can be used, in two different kinds of attacks, to improve the accuracy of those guesses. On the other hand, we also demonstrate the intractability of one kind of AUC exploit by proving that the number of possible binary labelings of $n$ examples for which a candidate solution obtains a AUC score of $c$ grows exponentially in $n$, for every $c\in (0,1)$. △ Less

Submitted 13 November, 2015; v1 submitted 3 June, 2015; originally announced June 2015.

arXiv:1306.0125 [pdf, other]

Understanding ACT-R - an Outsider's Perspective

Authors: Jacob Whitehill

Abstract: The ACT-R theory of cognition developed by John Anderson and colleagues endeavors to explain how humans recall chunks of information and how they solve problems. ACT-R also serves as a theoretical basis for "cognitive tutors", i.e., automatic tutoring systems that help students learn mathematics, computer programming, and other subjects. The official ACT-R definition is distributed across a large… ▽ More The ACT-R theory of cognition developed by John Anderson and colleagues endeavors to explain how humans recall chunks of information and how they solve problems. ACT-R also serves as a theoretical basis for "cognitive tutors", i.e., automatic tutoring systems that help students learn mathematics, computer programming, and other subjects. The official ACT-R definition is distributed across a large body of literature spanning many articles and monographs, and hence it is difficult for an "outsider" to learn the most important aspects of the theory. This paper aims to provide a tutorial to the core components of the ACT-R theory. △ Less

Submitted 1 June, 2013; originally announced June 2013.

arXiv:1110.0585 [pdf, other]

Discriminately Decreasing Discriminability with Learned Image Filters

Authors: Jacob Whitehill, Javier Movellan

Abstract: In machine learning and computer vision, input images are often filtered to increase data discriminability. In some situations, however, one may wish to purposely decrease discriminability of one classification task (a "distractor" task), while simultaneously preserving information relevant to another (the task-of-interest): For example, it may be important to mask the identity of persons containe… ▽ More In machine learning and computer vision, input images are often filtered to increase data discriminability. In some situations, however, one may wish to purposely decrease discriminability of one classification task (a "distractor" task), while simultaneously preserving information relevant to another (the task-of-interest): For example, it may be important to mask the identity of persons contained in face images before submitting them to a crowdsourcing site (e.g., Mechanical Turk) when labeling them for certain facial attributes. Another example is inter-dataset generalization: when training on a dataset with a particular covariance structure among multiple attributes, it may be useful to suppress one attribute while preserving another so that a trained classifier does not learn spurious correlations between attributes. In this paper we present an algorithm that finds optimal filters to give high discriminability to one task while simultaneously giving low discriminability to a distractor task. We present results showing the effectiveness of the proposed technique on both simulated data and natural face images. △ Less

Submitted 4 October, 2011; originally announced October 2011.

Showing 1–16 of 16 results for author: Whitehill, J