-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Authors:
Fuzhao Xue,
Yukang Chen,
Dacheng Li,
Qinghao Hu,
Ligeng Zhu,
Xiuyu Li,
Yunhao Fang,
Haotian Tang,
Shang Yang,
Zhijian Liu,
Ethan He,
Hongxu Yin,
Pavlo Molchanov,
Jan Kautz,
Linxi Fan,
Yuke Zhu,
Yao Lu,
Song Han
Abstract:
Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long su…
▽ More
Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 1024, improving the long video captioning score from 2.00 to 3.26 (out of 5), achieving 99.5% accuracy in 1400-frame (274k context length) video needle-in-a-haystack. LongVILA-8B demonstrates consistent accuracy improvements on long videos in the VideoMME benchmark as the number of frames increases. Besides, MM-SP is 2.1x - 5.7x faster than ring sequence parallelism and 1.1x - 1.4x faster than Megatron with context parallelism + tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.
△ Less
Submitted 21 August, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Topics in Algebra of Synchronous Games, Algebraic Graph Identities and Quantum NP-hardness Reductions
Authors:
Entong He
Abstract:
We review the correspondence between a synchronous game and its associated game algebra. We slightly develop the work of Helton et al.[HMPS17] by proposing results on algebraic and locally commuting graph identities. Based on the theoretical works on noncommutative Nullstellensätze [BWHK23], we build computational tools involving Gröbner basis method and semidefinite programming to check the exist…
▽ More
We review the correspondence between a synchronous game and its associated game algebra. We slightly develop the work of Helton et al.[HMPS17] by proposing results on algebraic and locally commuting graph identities. Based on the theoretical works on noncommutative Nullstellensätze [BWHK23], we build computational tools involving Gröbner basis method and semidefinite programming to check the existence of perfect strategies with specific models. We prove the equivalence between the hereditary and $C^*$ models proposed in [HMPS17]. We also extend Ji's reduction $\texttt{3-Coloring}^* \leq_p \texttt{3-SAT}^*$ [Ji13] and exhibit another instance of quantum-version NP-hardness reduction $\texttt{Clique}^* \leq_p \texttt{3-SAT}^*$.
△ Less
Submitted 22 August, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
PAIL: Performance based Adversarial Imitation Learning Engine for Carbon Neutral Optimization
Authors:
Yuyang Ye,
Lu-An Tang,
Haoyu Wang,
Runlong Yu,
Wenchao Yu,
Erhu He,
Haifeng Chen,
Hui Xiong
Abstract:
Achieving carbon neutrality within industrial operations has become increasingly imperative for sustainable development. It is both a significant challenge and a key opportunity for operational optimization in industry 4.0. In recent years, Deep Reinforcement Learning (DRL) based methods offer promising enhancements for sequential optimization processes and can be used for reducing carbon emission…
▽ More
Achieving carbon neutrality within industrial operations has become increasingly imperative for sustainable development. It is both a significant challenge and a key opportunity for operational optimization in industry 4.0. In recent years, Deep Reinforcement Learning (DRL) based methods offer promising enhancements for sequential optimization processes and can be used for reducing carbon emissions. However, existing DRL methods need a pre-defined reward function to assess the impact of each action on the final sustainable development goals (SDG). In many real applications, such a reward function cannot be given in advance. To address the problem, this study proposes a Performance based Adversarial Imitation Learning (PAIL) engine. It is a novel method to acquire optimal operational policies for carbon neutrality without any pre-defined action rewards. Specifically, PAIL employs a Transformer-based policy generator to encode historical information and predict following actions within a multi-dimensional space. The entire action sequence will be iteratively updated by an environmental simulator. Then PAIL uses a discriminator to minimize the discrepancy between generated sequences and real-world samples of high SDG. In parallel, a Q-learning framework based performance estimator is designed to estimate the impact of each action on SDG. Based on these estimations, PAIL refines generated policies with the rewards from both discriminator and performance estimator. PAIL is evaluated on multiple real-world application cases and datasets. The experiment results demonstrate the effectiveness of PAIL comparing to other state-of-the-art baselines. In addition, PAIL offers meaningful interpretability for the optimization in carbon neutrality.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Quantum Ranging Enhanced TDoA Localization
Authors:
Entong He,
Yuxiang Yang,
Chenshu Wu
Abstract:
Localization is critical to numerous applications. The performance of classical localization protocols is limited by the specific form of distance information and suffer from considerable ranging errors. This paper foresees a new opportunity by utilizing the exceptional property of entangled quantum states to measure a linear combination of target-anchor distances. Specifically, we consider locali…
▽ More
Localization is critical to numerous applications. The performance of classical localization protocols is limited by the specific form of distance information and suffer from considerable ranging errors. This paper foresees a new opportunity by utilizing the exceptional property of entangled quantum states to measure a linear combination of target-anchor distances. Specifically, we consider localization with quantum-based TDoA measurements. Classical TDoA ranging takes the difference of two separate measurements. Instead, quantum ranging allows TDoA estimation within a single measurement, thereby reducing the ranging errors. Numerical simulations demonstrate that the new quantum-based localization significantly outperforms conventional algorithms based on classical ranging, with over 50% gains on average.
△ Less
Submitted 25 April, 2024;
originally announced July 2024.
-
Recent Advances in End-to-End Simultaneous Speech Translation
Authors:
Xiaoqian Liu,
Guoqiang Hu,
Yangfan Du,
Erfeng He,
Yingfeng Luo,
Chen Xu,
Tong Xiao,
Jingbo Zhu
Abstract:
Simultaneous speech translation (SimulST) is a demanding task that involves generating translations in real-time while continuously processing speech input. This paper offers a comprehensive overview of the recent developments in SimulST research, focusing on four major challenges. Firstly, the complexities associated with processing lengthy and continuous speech streams pose significant hurdles.…
▽ More
Simultaneous speech translation (SimulST) is a demanding task that involves generating translations in real-time while continuously processing speech input. This paper offers a comprehensive overview of the recent developments in SimulST research, focusing on four major challenges. Firstly, the complexities associated with processing lengthy and continuous speech streams pose significant hurdles. Secondly, satisfying real-time requirements presents inherent difficulties due to the need for immediate translation output. Thirdly, striking a balance between translation quality and latency constraints remains a critical challenge. Finally, the scarcity of annotated data adds another layer of complexity to the task. Through our exploration of these challenges and the proposed solutions, we aim to provide valuable insights into the current landscape of SimulST research and suggest promising directions for future exploration.
△ Less
Submitted 20 August, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
FUGNN: Harmonizing Fairness and Utility in Graph Neural Networks
Authors:
Renqiang Luo,
Huafei Huang,
Shuo Yu,
Zhuoyang Han,
Estrid He,
Xiuzhen Zhang,
Feng Xia
Abstract:
Fairness-aware Graph Neural Networks (GNNs) often face a challenging trade-off, where prioritizing fairness may require compromising utility. In this work, we re-examine fairness through the lens of spectral graph theory, aiming to reconcile fairness and utility within the framework of spectral graph learning. We explore the correlation between sensitive features and spectrum in GNNs, using theore…
▽ More
Fairness-aware Graph Neural Networks (GNNs) often face a challenging trade-off, where prioritizing fairness may require compromising utility. In this work, we re-examine fairness through the lens of spectral graph theory, aiming to reconcile fairness and utility within the framework of spectral graph learning. We explore the correlation between sensitive features and spectrum in GNNs, using theoretical analysis to delineate the similarity between original sensitive features and those after convolution under different spectra. Our analysis reveals a reduction in the impact of similarity when the eigenvectors associated with the largest magnitude eigenvalue exhibit directional similarity. Based on these theoretical insights, we propose FUGNN, a novel spectral graph learning approach that harmonizes the conflict between fairness and utility. FUGNN ensures algorithmic fairness and utility by truncating the spectrum and optimizing eigenvector distribution during the encoding process. The fairness-aware eigenvector selection reduces the impact of convolution on sensitive features while concurrently minimizing the sacrifice of utility. FUGNN further optimizes the distribution of eigenvectors through a transformer architecture. By incorporating the optimized spectrum into the graph convolution network, FUGNN effectively learns node representations. Experiments on six real-world datasets demonstrate the superiority of FUGNN over baseline methods. The codes are available at https://github.com/yushuowiki/FUGNN.
△ Less
Submitted 13 August, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.
-
Metallic bonding in close packed structures: structural frustration from a hidden gauge symmetry
Authors:
Eric He,
C. M. Wilson,
R. Ganesh
Abstract:
Based on its simple valence electron configuration, we may expect lithium to have straightforward physical properties that are easily explained. However, solid lithium, when cooled below 77 K, develops a complex structure that has been debated for decades. A close parallel is found in sodium below 36 K where the crystal structure still remains unresolved. In this letter, we explore a possible driv…
▽ More
Based on its simple valence electron configuration, we may expect lithium to have straightforward physical properties that are easily explained. However, solid lithium, when cooled below 77 K, develops a complex structure that has been debated for decades. A close parallel is found in sodium below 36 K where the crystal structure still remains unresolved. In this letter, we explore a possible driving force behind this complexity. We begin with the observation that Li and Na form close-packed structures at low temperatures. We demonstrate a gauge symmetry that forces \textit{all} close-packed structures to have the same electronic energy and, in fact, the very same band structure. This symmetry requires two conditions: (a) bands must arise from $s$ orbitals, and (b) hoppings beyond second-nearest neighbours must be negligible. We argue that both can be reasonably invoked in Li and Na. When these conditions are satisfied, we have extensive degeneracy with the number of competing iso-energetic structures growing exponentially with linear system size. Weak effects, such as $p$-orbital admixture, long-range hopping and phonon zero-point energy, can break this symmetry. These can play a decisive role in `selecting' one particular ordered structure. This point of view may explain the occurrence of ordered structures in Li and Na under pressure. Our results suggest that martensitic transitions may also occur in heavier alkali metals such as potassium.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
QuERLoc: Towards Next-Generation Localization with Quantum-Enhanced Ranging
Authors:
Entong He,
Yuxiang Yang,
Chenshu Wu
Abstract:
Remarkable advances have been achieved in localization techniques in past decades, rendering it one of the most important technologies indispensable to our daily lives. In this paper, we investigate a novel localization approach for future computing by presenting QuERLoc, the first study on localization using quantum-enhanced ranging. By fine-tuning the evolution of an entangled quantum probe, qua…
▽ More
Remarkable advances have been achieved in localization techniques in past decades, rendering it one of the most important technologies indispensable to our daily lives. In this paper, we investigate a novel localization approach for future computing by presenting QuERLoc, the first study on localization using quantum-enhanced ranging. By fine-tuning the evolution of an entangled quantum probe, quantum ranging can output the information integrated in the probe as a specific mapping of distance-related parameters. QuERLoc is inspired by this unique property to measure a special combination of distances between a target sensor and multiple anchors within one single physical measurement. Leveraging this capability, QuERLoc settles two drawbacks of classical localization approaches: (i) the target-anchor distances must be measured individually and sequentially, and (ii) the resulting optimization problems are non-convex and are sensitive to noise. We first present the theoretical formulation of preparing the probing quantum state and controlling its dynamic to induce a convexified localization problem, and then solve it efficiently via optimization. We conduct extensive numerical analysis of QuERLoc under various settings. The results show that QuERLoc consistently outperforms classical approaches in accuracy and closely follows the theoretical lowerbound, while maintaining low time complexity. It achieves a minimum reduction of 73% in RMSE and 97.6% in time consumption compared to baselines. By introducing range-based quantum localization to the mobile computing community and showing its superior performance, QuERLoc sheds light on next-generation localization technologies and opens up new directions for future research.
△ Less
Submitted 4 May, 2024; v1 submitted 24 April, 2024;
originally announced April 2024.
-
Referee-Meta-Learning for Fast Adaptation of Locational Fairness
Authors:
Weiye Chen,
Yiqun Xie,
Xiaowei Jia,
Erhu He,
Han Bao,
Bang An,
Xun Zhou
Abstract:
When dealing with data from distinct locations, machine learning algorithms tend to demonstrate an implicit preference of some locations over the others, which constitutes biases that sabotage the spatial fairness of the algorithm. This unfairness can easily introduce biases in subsequent decision-making given broad adoptions of learning-based solutions in practice. However, locational biases in A…
▽ More
When dealing with data from distinct locations, machine learning algorithms tend to demonstrate an implicit preference of some locations over the others, which constitutes biases that sabotage the spatial fairness of the algorithm. This unfairness can easily introduce biases in subsequent decision-making given broad adoptions of learning-based solutions in practice. However, locational biases in AI are largely understudied. To mitigate biases over locations, we propose a locational meta-referee (Meta-Ref) to oversee the few-shot meta-training and meta-testing of a deep neural network. Meta-Ref dynamically adjusts the learning rates for training samples of given locations to advocate a fair performance across locations, through an explicit consideration of locational biases and the characteristics of input data. We present a three-phase training framework to learn both a meta-learning-based predictor and an integrated Meta-Ref that governs the fairness of the model. Once trained with a distribution of spatial tasks, Meta-Ref is applied to samples from new spatial tasks (i.e., regions outside the training area) to promote fairness during the fine-tune step. We carried out experiments with two case studies on crop monitoring and transportation safety, which show Meta-Ref can improve locational fairness while keeping the overall prediction quality at a similar level.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
SCALA: Sparsification-based Contrastive Learning for Anomaly Detection on Attributed Networks
Authors:
Enbo He,
Yitong Hao,
Yue Zhang,
Guisheng Yin,
Lina Yao
Abstract:
Anomaly detection on attributed networks aims to find the nodes whose behaviors are significantly different from other majority nodes. Generally, network data contains information about relationships between entities, and the anomaly is usually embodied in these relationships. Therefore, how to comprehensively model complex interaction patterns in networks is still a major focus. It can be observe…
▽ More
Anomaly detection on attributed networks aims to find the nodes whose behaviors are significantly different from other majority nodes. Generally, network data contains information about relationships between entities, and the anomaly is usually embodied in these relationships. Therefore, how to comprehensively model complex interaction patterns in networks is still a major focus. It can be observed that anomalies in networks violate the homophily assumption. However, most existing studies only considered this phenomenon obliquely rather than explicitly. Besides, the node representation of normal entities can be perturbed easily by the noise relationships introduced by anomalous nodes. To address the above issues, we present a novel contrastive learning framework for anomaly detection on attributed networks, \textbf{SCALA}, aiming to improve the embedding quality of the network and provide a new measurement of qualifying the anomaly score for each node by introducing sparsification into the conventional method. Extensive experiments are conducted on five benchmark real-world datasets and the results show that SCALA consistently outperforms all baseline methods significantly.
△ Less
Submitted 8 January, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition
Authors:
Chen Xu,
Xiaoqian Liu,
Erfeng He,
Yuhao Zhang,
Qianqian Dong,
Tong Xiao,
Jingbo Zhu,
Dapeng Man,
Wu Yang
Abstract:
In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Build…
▽ More
In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at https://github.com/xuchennlp/S2T.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Entity Aware Modelling: A Survey
Authors:
Rahul Ghosh,
Haoyu Yang,
Ankush Khandelwal,
Erhu He,
Arvind Renganathan,
Somya Sharma,
Xiaowei Jia,
Vipin Kumar
Abstract:
Personalized prediction of responses for individual entities caused by external drivers is vital across many disciplines. Recent machine learning (ML) advances have led to new state-of-the-art response prediction models. Models built at a population level often lead to sub-optimal performance in many personalized prediction settings due to heterogeneity in data across entities (tasks). In personal…
▽ More
Personalized prediction of responses for individual entities caused by external drivers is vital across many disciplines. Recent machine learning (ML) advances have led to new state-of-the-art response prediction models. Models built at a population level often lead to sub-optimal performance in many personalized prediction settings due to heterogeneity in data across entities (tasks). In personalized prediction, the goal is to incorporate inherent characteristics of different entities to improve prediction performance. In this survey, we focus on the recent developments in the ML community for such entity-aware modeling approaches. ML algorithms often modulate the network using these entity characteristics when they are readily available. However, these entity characteristics are not readily available in many real-world scenarios, and different ML methods have been proposed to infer these characteristics from the data. In this survey, we have organized the current literature on entity-aware modeling based on the availability of these characteristics as well as the amount of training data. We highlight how recent innovations in other disciplines, such as uncertainty quantification, fairness, and knowledge-guided machine learning, can improve entity-aware modeling.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Bound states without potentials: localization at singularities
Authors:
Eric He,
R. Ganesh
Abstract:
Bound state formation is a classic feature of quantum mechanics, where a particle localizes in the vicinity of an attractive potential. This is typically understood as the particle lowering its potential energy. In this article, we discuss a paradigm where bound states arise purely due to kinetic energy considerations. This phenomenon occurs in certain non-manifold spaces that consist of multiple…
▽ More
Bound state formation is a classic feature of quantum mechanics, where a particle localizes in the vicinity of an attractive potential. This is typically understood as the particle lowering its potential energy. In this article, we discuss a paradigm where bound states arise purely due to kinetic energy considerations. This phenomenon occurs in certain non-manifold spaces that consist of multiple smooth surfaces that intersect one another. The intersection region can be viewed as a singularity where dimensionality is not defined. We demonstrate this idea in a setting where a particle moves on $M$ spaces ($M=2, 3, 4, \ldots$), each of dimensionality $D$ ($D=1, 2$ and $3$). The spaces intersect at a common point, which serves as a singularity. To study quantum behaviour in this setting, we discretize space and adopt a tight-binding approach. We generically find a ground state that is localized around the singular point, bound by the kinetic energy of `shuttling' among the $M$ surfaces. We draw a quantitative analogy between singularities on the one hand and local attractive potentials on the other. To each singularity, we assign an equivalent potential that produces the same bound state wavefunction and binding energy. The degree of a singularity ($M$, the number of intersecting surfaces) determines the strength of the equivalent potential. With $D=1$ and $D=2$, we show that any singularity creates a bound state. This is analogous to the well known fact that any attractive potential creates a bound state in 1D and 2D. In contrast, with $D=3$, bound states only appear when the degree of the singularity exceeds a threshold value. This is analogous to the fact that in three dimensions, a threshold potential strength is required for bound state formation. We discuss implications for experiments and theoretical studies in various domains of quantum physics.
△ Less
Submitted 7 August, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Analysis Without Data: Teaching Students to Tackle the VAST Challenge
Authors:
Edward W He,
Daniel Tolessa,
Ashley Suh,
Remco Chang
Abstract:
The VAST Challenges have been shown to be an effective tool in visual analytics education, encouraging student learning while enforcing good visualization design and development practices. However, research has observed that students often struggle at identifying a good "starting point" when tackling the VAST Challenge. Consequently, students who could not identify a good starting point failed at…
▽ More
The VAST Challenges have been shown to be an effective tool in visual analytics education, encouraging student learning while enforcing good visualization design and development practices. However, research has observed that students often struggle at identifying a good "starting point" when tackling the VAST Challenge. Consequently, students who could not identify a good starting point failed at finding the correct solution to the challenge. In this paper, we propose a preliminary guideline for helping students approach the VAST Challenge and identify initial analysis directions. We recruited two students to analyze the VAST 2017 Challenge using a hypothesis-driven approach, where they were required to pre-register their hypotheses prior to inspecting and analyzing the full dataset. From their experience, we developed a prescriptive guideline for other students to tackle VAST Challenges. In a preliminary study, we found that the students were able to use the guideline to generate well-formed hypotheses that could lead them towards solving the challenge. Additionally, the students reported that with the guideline, they felt like they had concrete steps that they could follow, thereby alleviating the burden of identifying a good starting point in their analysis process.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Preserved Structure Across Vector Space Representations
Authors:
Andrei Amatuni,
Estelle He,
Elika Bergelson
Abstract:
Certain concepts, words, and images are intuitively more similar than others (dog vs. cat, dog vs. spoon), though quantifying such similarity is notoriously difficult. Indeed, this kind of computation is likely a critical part of learning the category boundaries for words within a given language. Here, we use a set of 27 items (e.g. 'dog') that are highly common in infants' input, and use both ima…
▽ More
Certain concepts, words, and images are intuitively more similar than others (dog vs. cat, dog vs. spoon), though quantifying such similarity is notoriously difficult. Indeed, this kind of computation is likely a critical part of learning the category boundaries for words within a given language. Here, we use a set of 27 items (e.g. 'dog') that are highly common in infants' input, and use both image- and word-based algorithms to independently compute similarity among them. We find three key results. First, the pairwise item similarities derived within image-space and word-space are correlated, suggesting preserved structure among these extremely different representational formats. Second, the closest 'neighbors' for each item, within each space, showed significant overlap (e.g. both found 'egg' as a neighbor of 'apple'). Third, items with the most overlapping neighbors are later-learned by infants and toddlers. We conclude that this approach, which does not rely on human ratings of similarity, may nevertheless reflect stable within-class structure across these two spaces. We speculate that such invariance might aid lexical acquisition, by serving as an informative marker of category boundaries.
△ Less
Submitted 14 May, 2018; v1 submitted 2 February, 2018;
originally announced February 2018.