-
Platypus: A Generalized Specialist Model for Reading Text in Various Forms
Authors:
Peng Wang,
Zhaohai Li,
Jun Tang,
Humen Zhong,
Fei Huang,
Zhibo Yang,
Cong Yao
Abstract:
Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist…
▽ More
Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able to recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library
Authors:
Tianhao Yu,
Cai Yao,
Zhuorui Sun,
Feng Shi,
Lin Zhang,
Kangjie Lyu,
Xuan Bai,
Andong Liu,
Xicheng Zhang,
Jiali Zou,
Wenshou Wang,
Chris Lai,
Kai Wang
Abstract:
In this study, we generate and maintain a database of 10 million virtual lipids through METiS's in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT,…
▽ More
In this study, we generate and maintain a database of 10 million virtual lipids through METiS's in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT, a BERT-like model pre-trained with the Masked Language Model (MLM) and various secondary tasks. Additionally, we compare the performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like lipid generation model, on downstream tasks. The proposed bilingual LipidBERT model operates in two languages: the language of ionizable lipid pre-training, using in-house dry-lab lipid structures, and the language of LNP fine-tuning, utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT as a key AI-based filter for future screening tasks, including new versions of METiS de novo lipid libraries and, more importantly, candidates for in vivo testing for orgran-targeting LNPs. To the best of our knowledge, this is the first successful demonstration of the capability of a pre-trained language model on virtual lipids and its effectiveness in downstream tasks using web-lab data. This work showcases the clever utilization of METiS's in-house de novo lipid library as well as the power of dry-wet lab integration.
△ Less
Submitted 19 August, 2024; v1 submitted 12 August, 2024;
originally announced August 2024.
-
Exotic thermoelectric properties of coronene-cyclobutadienoid graphene nanoribbons
Authors:
C. Yao,
Chen Kong,
H. F. Feng,
Y. Dong,
L. Huang,
X. Zhang,
Z. X. Song,
Zhi-Xin Guo
Abstract:
Thermoelectric materials traditionally incorporate heavy metals to achieve low lattice thermal conductivity. However, elements such as Te, Bi, and Pb are costly and pose environmental hazards. In this study, we introduce a novel design strategy for thermoelectric materials, focusing on room-temperature, light-element, and high-ZT materials such as coronene-cyclobutadienoid graphene nanoribbons (co…
▽ More
Thermoelectric materials traditionally incorporate heavy metals to achieve low lattice thermal conductivity. However, elements such as Te, Bi, and Pb are costly and pose environmental hazards. In this study, we introduce a novel design strategy for thermoelectric materials, focusing on room-temperature, light-element, and high-ZT materials such as coronene-cyclobutadienoid graphene nanoribbons (cor4GNRs). This material demonstrates a ZT value exceeding 2.1, attributed to its exceptionally low phonon thermal conductivity resulting from its unique edge structure. Importantly, its electrical conductance and Seebeck coefficient remain relatively high and nearly unaffected by the edge structure. This distinct behavior in phonon and electronic transport properties leads to a remarkably high ZT value. Additionally, we discover that applying strain can significantly reduce phonon thermal conductivity, potentially increasing the ZT value to over 3.0. Our findings provide innovative insights for the design and application of advanced thermoelectric materials.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
Chip-scale sensor for spectroscopic metrology
Authors:
Chunhui Yao,
Wanlu Zhang,
Peng Bao,
Jie Ma,
Wei Zhuo,
Minjia Chen,
Zhitian Shi,
Jingwen Zhou,
Yuxiao Ye,
Liang Ming,
Ting Yan,
Richard Penty,
Qixiang Cheng
Abstract:
Miniaturized spectrometers hold great promise for in situ, in vitro, and even in vivo sensing applications. However, their size reduction imposes vital performance constraints in meeting the rigorous demands of spectroscopy, including fine resolution, high accuracy, and ultra-wide observation window. The prevailing view in the community holds that miniaturized spectrometers are most suitable for t…
▽ More
Miniaturized spectrometers hold great promise for in situ, in vitro, and even in vivo sensing applications. However, their size reduction imposes vital performance constraints in meeting the rigorous demands of spectroscopy, including fine resolution, high accuracy, and ultra-wide observation window. The prevailing view in the community holds that miniaturized spectrometers are most suitable for the coarse identification of signature peaks. In this paper, we present an integrated reconstructive spectrometer that enables near-infrared (NIR) spectroscopic metrology, and demonstrate a fully packaged sensor with auxiliary electronics. Such a sensor operates over a 520 nm bandwidth together with a resolution of less than 8 pm, which translates into a record-breaking bandwidth-to-resolution ratio of over 65,000. The classification of different types of solid substances and the concentration measurement of aqueous and organic solutions are performed, all achieving approximately 100% accuracy. Notably, the detection limit of our sensor matches that of the commercial benchtop counterparts, which is as low as 0.1% (i.e. 100 mg/dL) for identifying the concentration of glucose solution.
△ Less
Submitted 12 August, 2024; v1 submitted 25 July, 2024;
originally announced July 2024.
-
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency
Authors:
Yiming Xie,
Chun-Han Yao,
Vikram Voleti,
Huaizu Jiang,
Varun Jampani
Abstract:
We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation. Unlike previous methods that rely on separately trained generative models for video generation and novel view synthesis, we design a unified diffusion model to generate novel view videos of dynamic 3D objects. Specifically, given a monocular reference video, SV…
▽ More
We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation. Unlike previous methods that rely on separately trained generative models for video generation and novel view synthesis, we design a unified diffusion model to generate novel view videos of dynamic 3D objects. Specifically, given a monocular reference video, SV4D generates novel views for each video frame that are temporally consistent. We then use the generated novel view videos to optimize an implicit 4D representation (dynamic NeRF) efficiently, without the need for cumbersome SDS-based optimization used in most prior works. To train our unified novel view video generation model, we curated a dynamic 3D object dataset from the existing Objaverse dataset. Extensive experimental results on multiple datasets and user studies demonstrate SV4D's state-of-the-art performance on novel-view video synthesis as well as 4D generation compared to prior works.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation
Authors:
Zirui Shao,
Feiyu Gao,
Hangdi Xing,
Zepeng Zhu,
Zhi Yu,
Jiajun Bu,
Qi Zheng,
Cong Yao
Abstract:
In the era of content creation revolution propelled by advancements in generative models, the field of web design remains unexplored despite its critical role in modern digital communication. The web design process is complex and often time-consuming, especially for those with limited expertise. In this paper, we introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at autom…
▽ More
In the era of content creation revolution propelled by advancements in generative models, the field of web design remains unexplored despite its critical role in modern digital communication. The web design process is complex and often time-consuming, especially for those with limited expertise. In this paper, we introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code. WebRPG would contribute to a faster web development workflow. Since there is no existing benchmark available, we develop a new dataset for WebRPG through an automated pipeline. Moreover, we present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML. Extensive experiments, including customized quantitative evaluations for this specific task, are conducted to evaluate the quality of the generated results.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
A Random Matrix Model for a Family of Cusp Forms
Authors:
Owen Barrett,
Zoë X. Batterman,
Aditya Jambhale,
Steven J. Miller,
Akash L. Narayanan,
Kishan Sharma,
Chris Yao
Abstract:
The Katz-Sarnak philosophy states that statistics of zeros of $L$-function families near the central point as the conductors tend to infinity agree with those of eigenvalues of random matrix ensembles as the matrix size tends to infinity. While numerous results support this conjecture, S. J. Miller observed that for finite conductors, very different behavior can occur for zeros near the central po…
▽ More
The Katz-Sarnak philosophy states that statistics of zeros of $L$-function families near the central point as the conductors tend to infinity agree with those of eigenvalues of random matrix ensembles as the matrix size tends to infinity. While numerous results support this conjecture, S. J. Miller observed that for finite conductors, very different behavior can occur for zeros near the central point in elliptic curve $L$-function families. This led to the creation of the excised model of Dueñez, Huynh, Keating, Miller, and Snaith, whose predictions for quadratic twists of a given elliptic curve are well fit by the data. The key ingredients are relating the discretization of central values of the $L$-functions to excising matrices based on the value of the characteristic polynomials at 1 and using lower order terms (in statistics such as the one-level density and pair-correlation) to adjust the matrix size. We extended this model for a family of twists of an $L$-function associated to a given holomorphic cuspidal newform of odd prime level and arbitrary weight. We derive the corresponding "effective" matrix size for a given form by computing the one-level density and pair-correlation statistics for a chosen family of twists, and we show there is no repulsion for forms with weight greater than 2 and principal nebentype. We experimentally verify the accuracy of the model, and as expected, our model recovers the elliptic curve model.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Visual Text Generation in the Wild
Authors:
Yuanzhi Zhu,
Jiawei Liu,
Feiyu Gao,
Wenyu Liu,
Xinggang Wang,
Peng Wang,
Fei Huang,
Cong Yao,
Zhibo Yang
Abstract:
Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in…
▽ More
Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data
Authors:
Yufan Shen,
Chuwei Luo,
Zhaoqing Zhu,
Yang Chen,
Qi Zheng,
Zhi Yu,
Jiajun Bu,
Cong Yao
Abstract:
Recently, large language models (LLMs) and multimodal large language models (MLLMs) have demonstrated promising results on document visual question answering (VQA) task, particularly after training on document instruction datasets. An effective evaluation method for document instruction data is crucial in constructing instruction data with high efficacy, which, in turn, facilitates the training of…
▽ More
Recently, large language models (LLMs) and multimodal large language models (MLLMs) have demonstrated promising results on document visual question answering (VQA) task, particularly after training on document instruction datasets. An effective evaluation method for document instruction data is crucial in constructing instruction data with high efficacy, which, in turn, facilitates the training of LLMs and MLLMs for document VQA. However, most existing evaluation methods for instruction data are limited to the textual content of the instructions themselves, thereby hindering the effective assessment of document instruction datasets and constraining their construction. In this paper, we propose ProcTag, a data-oriented method that assesses the efficacy of document instruction data. ProcTag innovatively performs tagging on the execution process of instructions rather than the instruction text itself. By leveraging the diversity and complexity of these tags to assess the efficacy of the given dataset, ProcTag enables selective sampling or filtering of document instructions. Furthermore, DocLayPrompt, a novel semi-structured layout-aware document prompting strategy, is proposed for effectively representing documents. Experiments demonstrate that sampling existing open-sourced and generated document VQA/instruction datasets with ProcTag significantly outperforms current methods for evaluating instruction data. Impressively, with ProcTag-based sampling in the generated document datasets, only 30.5\% of the document instructions are required to achieve 100\% efficacy compared to the complete dataset. The code is publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/ProcTag.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Temporally Consistent Stereo Matching
Authors:
Jiaxi Zeng,
Chengtang Yao,
Yuwei Wu,
Yunde Jia
Abstract:
Stereo matching provides depth estimation from binocular images for downstream applications. These applications mostly take video streams as input and require temporally consistent depth maps. However, existing methods mainly focus on the estimation at the single-frame level. This commonly leads to temporally inconsistent results, especially in ill-posed regions. In this paper, we aim to leverage…
▽ More
Stereo matching provides depth estimation from binocular images for downstream applications. These applications mostly take video streams as input and require temporally consistent depth maps. However, existing methods mainly focus on the estimation at the single-frame level. This commonly leads to temporally inconsistent results, especially in ill-posed regions. In this paper, we aim to leverage temporal information to improve the temporal consistency, accuracy, and efficiency of stereo matching. To achieve this, we formulate video stereo matching as a process of temporal disparity completion followed by continuous iterative refinements. Specifically, we first project the disparity of the previous timestamp to the current viewpoint, obtaining a semi-dense disparity map. Then, we complete this map through a disparity completion module to obtain a well-initialized disparity map. The state features from the current completion module and from the past refinement are fused together, providing a temporally coherent state for subsequent refinement. Based on this coherent state, we introduce a dual-space refinement module to iteratively refine the initialized result in both disparity and disparity gradient spaces, improving estimations in ill-posed regions. Extensive experiments demonstrate that our method effectively alleviates temporal inconsistency while enhancing both accuracy and efficiency.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
MPCODER: Multi-user Personalized Code Generator with Explicit and Implicit Style Representation Learning
Authors:
Zhenlong Dai,
Chang Yao,
WenKang Han,
Ying Yuan,
Zhipeng Gao,
Jingyuan Chen
Abstract:
Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn co…
▽ More
Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn coding style features, we utilize explicit coding style residual learning to capture the syntax code style standards and implicit style learning to capture the semantic code style conventions. We train a multi-user style adapter to better differentiate the implicit feature representations of different users through contrastive learning, ultimately enabling personalized code generation for multiple users. We further propose a novel evaluation metric for estimating similarities between codes of different coding styles. The experimental results show the effectiveness of our approach for this novel task.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Baryon-number-violating nucleon decays in ALP effective field theories
Authors:
Tong Li,
Michael A. Schmidt,
Chang-Yuan Yao
Abstract:
The search for baryon-number-violating (BNV) nucleon decay is an intriguing probe of new physics beyond the SM in future neutrino experiments with enhanced sensitivity. The dark sector states such as an axion or axion-like particle (ALP) can induce nucleon decays with distinct signature and kinematics from the conventional nucleon decays. In this work, we study the ALP effective field theories (EF…
▽ More
The search for baryon-number-violating (BNV) nucleon decay is an intriguing probe of new physics beyond the SM in future neutrino experiments with enhanced sensitivity. The dark sector states such as an axion or axion-like particle (ALP) can induce nucleon decays with distinct signature and kinematics from the conventional nucleon decays. In this work, we study the ALP effective field theories (EFTs) with baryon number violation and the impact of light ALP on BNV nucleon decays. We revisit the dimension-8 BNV operators in the extended EFTs with an ALP field $a$ respecting shift symmetry. The low-energy EFT operators with $|Δ(B-L)|=2$ and $|Δ(B-L)|=0$ are matched to the baryon chiral perturbation theory. We obtain the effective chiral Lagrangian and the BNV interactions between ALP and baryons/mesons. The ALP interactions lead to two-body baryon decays $B\to \ell~({\rm or}~ν)~a$ and three-body nucleon decays $N\to M~\ell~({\rm or}~ν)~a$. We obtain the constraints on the UV scale from the invisible $Λ^0$ decay search at BESIII, the invisible neutron decay search at KamLAND and proton decay search at Super-K. We also show the projections of some other baryon/nucleon decays and present the distinct distributions of kinematic observable.
△ Less
Submitted 16 August, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
ProcessPainter: Learn Painting Process from Sequence Data
Authors:
Yiren Song,
Shijie Huang,
Chen Yao,
Xiaojun Ye,
Hai Ci,
Jiaming Liu,
Yuxuan Zhang,
Mike Zheng Shou
Abstract:
The painting process of artists is inherently stepwise and varies significantly among different painters and styles. Generating detailed, step-by-step painting processes is essential for art education and research, yet remains largely underexplored. Traditional stroke-based rendering methods break down images into sequences of brushstrokes, yet they fall short of replicating the authentic processe…
▽ More
The painting process of artists is inherently stepwise and varies significantly among different painters and styles. Generating detailed, step-by-step painting processes is essential for art education and research, yet remains largely underexplored. Traditional stroke-based rendering methods break down images into sequences of brushstrokes, yet they fall short of replicating the authentic processes of artists, with limitations confined to basic brushstroke modifications. Text-to-image models utilizing diffusion processes generate images through iterative denoising, also diverge substantially from artists' painting process. To address these challenges, we introduce ProcessPainter, a text-to-video model that is initially pre-trained on synthetic data and subsequently fine-tuned with a select set of artists' painting sequences using the LoRA model. This approach successfully generates painting processes from text prompts for the first time. Furthermore, we introduce an Artwork Replication Network capable of accepting arbitrary-frame input, which facilitates the controlled generation of painting processes, decomposing images into painting sequences, and completing semi-finished artworks. This paper offers new perspectives and tools for advancing art education and image generation technology.
△ Less
Submitted 20 July, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Gravitating vortices and Symplectic Reduction by Stages
Authors:
L. Álvarez-Cónsul,
M. Garcia-Fernandez,
O. García-Prada,
V. P. Pingali,
C. -J. Yao
Abstract:
We undertake a novel approach to the existence problem for gravitating vortices on a Riemann surface based on symplectic reduction by stages, which seems to be new in the PDE as well as the gauge theory literature. The main technical tool for our study is the reduced $α$-K-energy, for which we establish convexity properties by means of finite-energy pluripotential theory, as recently applied to th…
▽ More
We undertake a novel approach to the existence problem for gravitating vortices on a Riemann surface based on symplectic reduction by stages, which seems to be new in the PDE as well as the gauge theory literature. The main technical tool for our study is the reduced $α$-K-energy, for which we establish convexity properties by means of finite-energy pluripotential theory, as recently applied to the study of constant scalar curvature Kähler metrics. Using these methods, we prove that the existence of solutions to the gravitating vortex equations on the sphere implies the polystability of the effective divisor defined by the zeroes of the Higgs field. This approach also enables us to establish the uniqueness of gravitating vortices in any admissible Kähler class, in the absence of automorphisms. Lastly, we also prove the existence of solutions for the gravitating vortex equations for genus $g\geq 1$ for certain ranges of the coupling constant $α$ and the volume.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Authors:
Philip Anastassiou,
Jiawei Chen,
Jitong Chen,
Yuanzhe Chen,
Zhuo Chen,
Ziyi Chen,
Jian Cong,
Lelai Deng,
Chuang Ding,
Lu Gao,
Mingqing Gong,
Peisong Huang,
Qingqing Huang,
Zhiying Huang,
Yuanyuan Huo,
Dongya Jia,
Chumin Li,
Feiya Li,
Hui Li,
Jiaxin Li,
Xiaoyang Li,
Xingxing Li,
Lin Liu,
Shouda Liu,
Sichao Liu
, et al. (21 additional authors not shown)
Abstract:
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub…
▽ More
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
The flavor invariants of the $ν$SM
Authors:
Christophe Grojean,
Jonathan Kley,
Damien Leflot,
Chang-Yuan Yao
Abstract:
Sixty years after the experimental discovery of CP violation in the quark sector, the existence of a similar CP violation in the lepton sector is still to be established. Actually, the structure of such a violation depends crucially on the origin of the neutrino masses. In an attempt at categorizing the leptonic sources of CP violation, we studied the $ν$SM, the Standard Model extended with three…
▽ More
Sixty years after the experimental discovery of CP violation in the quark sector, the existence of a similar CP violation in the lepton sector is still to be established. Actually, the structure of such a violation depends crucially on the origin of the neutrino masses. In an attempt at categorizing the leptonic sources of CP violation, we studied the $ν$SM, the Standard Model extended with three generations of sterile neutrinos, that can interpolate continuously between the Dirac and Majorana scenarios of neutrino masses. In particular, we perform a classification of the Jarlskog-like flavor invariants entering CP-violating observables and we study their suppression with the heavy Majorana mass in the seesaw limit of the model. To simplify the construction of the invariants, we introduce a graph-based method. With the guidance of the Hilbert series and plethystic logarithm of the theory, we construct the \emph{generating} and \emph{primary} sets of invariants for the $ν$SM for the first time. Unlike in the Standard Model and some other theories, we find that the numbers of generating invariants and the syzygies among them cannot immediately be read off from the plethystic logarithm, but require a more careful examination. Our analysis reveals that the \emph{generating} set contains 459 invariants, out of which 208 are CP-even and 251 are CP-odd. In the seesaw limit of the $ν$SM, we show that all parameters of the UV theory can be captured in the effective theory with a certain suppression with the heavy Majorana mass, while these parameters can only appear in a \emph{flavor-invariant} way with a \emph{higher} mass suppression. Furthermore, we discuss how the necessary and sufficient conditions for CP violation can be captured by utilizing these invariants. Along the way, we present useful algorithms to enumerate and build the flavor invariants.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
Asymmetrical estimator for training encapsulated deep photonic neural networks
Authors:
Yizhi Wang,
Minjia Chen,
Chunhui Yao,
Jie Ma,
Ting Yan,
Richard Penty,
Qixiang Cheng
Abstract:
Scalable isomorphic physical neural networks (PNNs) are emerging NN acceleration paradigms for their high-bandwidth, in-propagation computation. Despite backpropagation (BP)-based training is often the industry standard for its robustness and fast gradient convergences, existing BP-PNN training methods need to truncate the propagation of analogue signal at each layer and acquire accurate hidden ne…
▽ More
Scalable isomorphic physical neural networks (PNNs) are emerging NN acceleration paradigms for their high-bandwidth, in-propagation computation. Despite backpropagation (BP)-based training is often the industry standard for its robustness and fast gradient convergences, existing BP-PNN training methods need to truncate the propagation of analogue signal at each layer and acquire accurate hidden neuron readouts for deep networks. This compromises the incentive of PNN for fast in-propagation processing. In addition, the required readouts introduce massive bottlenecks due to the conversions between the analogue-digital interfaces to shuttle information across. These factors limit both the time and energy efficiency during training. Here we introduce the asymmetrical training (AT) method, a BP-based method that can perform training on an encapsulated deep network, where the information propagation is maintained within the analogue domain until the output layer. AT's minimum information access bypass analogue-digital interface bottleneck wherever possible. For any deep network structure, AT offers significantly improved time and energy efficiency compared to existing BP-PNN methods, and scales well for large network sizes. We demonstrated AT's error-tolerant and calibration-free training for encapsulated integrated photonic deep networks to achieve near ideal BP performances. AT's well-behaved training is demonstrated repeatably across different datasets and network structures
△ Less
Submitted 15 August, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
I$^2$VC: A Unified Framework for Intra- & Inter-frame Video Compression
Authors:
Meiqin Liu,
Chenming Xu,
Yukai Gu,
Chao Yao,
Yao Zhao
Abstract:
Video compression aims to reconstruct seamless frames by encoding the motion and residual information from existing frames. Previous neural video compression methods necessitate distinct codecs for three types of frames (I-frame, P-frame and B-frame), which hinders a unified approach and generalization across different video contexts. Intra-codec techniques lack the advanced Motion Estimation and…
▽ More
Video compression aims to reconstruct seamless frames by encoding the motion and residual information from existing frames. Previous neural video compression methods necessitate distinct codecs for three types of frames (I-frame, P-frame and B-frame), which hinders a unified approach and generalization across different video contexts. Intra-codec techniques lack the advanced Motion Estimation and Motion Compensation (MEMC) found in inter-codec, leading to fragmented frameworks lacking uniformity. Our proposed Intra- & Inter-frame Video Compression (I$^2$VC) framework employs a single spatio-temporal codec that guides feature compression rates according to content importance. This unified codec transforms the dependence across frames into a conditional coding scheme, thus integrating intra- and inter-frame compression into one cohesive strategy. Given the absence of explicit motion data, achieving competent inter-frame compression with only a conditional codec poses a challenge. To resolve this, our approach includes an implicit inter-frame alignment mechanism. With the pre-trained diffusion denoising process, the utilization of a diffusion-inverted reference feature rather than random noise supports the initial compression state. This process allows for selective denoising of motion-rich regions based on decoded features, facilitating accurate alignment without the need for MEMC. Our experimental findings, across various compression configurations (AI, LD and RA) and frame types, prove that I$^2$VC outperforms the state-of-the-art perceptual learned codecs. Impressively, it exhibits a 58.4% enhancement in perceptual reconstruction performance when benchmarked against the H.266/VVC standard (VTM). Official implementation can be found at https://github.com/GYukai/I2VC.
△ Less
Submitted 1 June, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Physics-informed Data-driven Cavitation Model for a Specific MG EOS
Authors:
Minsheng Huang,
Chengbao Yao,
Pan Wang,
Lidong Cheng,
Wenjun Ying
Abstract:
We present a novel one-fluid cavitation model of a specific Mie-Grüneisen equation of state(EOS), named polynomial EOS, based on an artificial neural network. Not only the physics-informed equation but also the experimental data are embedded into the proposed model by an optimization problem. The physics-informed data-driven model provides the concerned pressure within the cavitation region, where…
▽ More
We present a novel one-fluid cavitation model of a specific Mie-Grüneisen equation of state(EOS), named polynomial EOS, based on an artificial neural network. Not only the physics-informed equation but also the experimental data are embedded into the proposed model by an optimization problem. The physics-informed data-driven model provides the concerned pressure within the cavitation region, where the density tends to zero when the pressure falls below the saturated pressure. The present model is then applied to computing the challenging compressible multi-phase flow simulation, such as nuclear and underwater explosions. Numerical simulations show that our model in application agrees well with the corresponding experimental data, ranging from one dimension to three dimensions with the $h-$adaptive mesh refinement algorithm and load balance techniques in the structured and unstructured grid.
△ Less
Submitted 5 April, 2024;
originally announced May 2024.
-
The strong-coupling quantum thermodynamics of quantum Brownian motion based on the exact solution of its reduced density matrix
Authors:
Chuan-Zhe Yao,
Wei-Min Zhang
Abstract:
We derive the quantum thermodynamics of quantum Brownian motion from the exact solution of its reduced density matrix. We start from the total equilibrium thermal state between the Brownian particle and its reservoir, and solve analytically and exactly the reduced density matrix of the system by taking the partial trace over all the reservoir states. We find that the reduced Hamiltonian and the re…
▽ More
We derive the quantum thermodynamics of quantum Brownian motion from the exact solution of its reduced density matrix. We start from the total equilibrium thermal state between the Brownian particle and its reservoir, and solve analytically and exactly the reduced density matrix of the system by taking the partial trace over all the reservoir states. We find that the reduced Hamiltonian and the reduced partition function of the Brownian motion must be renormalized significantly, as shown in the general nonperturbative renormalization theory of quantum thermodynamics for open quantum systems we developed recently [Phys. Rev. Res. 4, 023141 (2022)]. The reduced Hamiltonian contains not only a frequency shift but also a squeezing pairing interaction, where a momentum-dependent potential is generated naturally from the strong coupling between the Brownian particle and the reservoir, after traced over all the reservoir states. The resulting exact reduced density matrix of the Brownian motion is given by a squeezing thermal state. Moreover, beyond the weak coupling limit, in order to obtain correctly the reduced partition function of the Brownian motion, one must take into account the non-negligible changes of the reservoir state induced by the system-reservoir coupling. Using the exact solutions of the reduced density matrix, the reduced Hamiltonian as well as the reduced partition function of the Brownian motion, we show that the controversial results obtained from the different definitions of internal energy and the issue of the negative heat capacity in the previous studies of strong-coupling quantum thermodynamics are resolved.
△ Less
Submitted 5 July, 2024; v1 submitted 30 April, 2024;
originally announced May 2024.
-
Convergence of the hypersymplectic flow on $T^4$ with $T^3$-symmetry
Authors:
Joel Fine,
Weiyong He,
Chengjian Yao
Abstract:
A hypersymplectic structure on a 4-manifold is a triple $ω_1, ω_2, ω_3$ of 2-forms for which every non-trivial linear combination $a^1ω_1 + a^2 ω_2 + a^3 ω_3$ is a symplectic form. Donaldson has conjectured that when the underlying manifold is compact, any such structure is isotopic in its cohomolgy class to a hyperkähler triple. We prove this conjecture for a hypersymplectic structure on $T^4$ wh…
▽ More
A hypersymplectic structure on a 4-manifold is a triple $ω_1, ω_2, ω_3$ of 2-forms for which every non-trivial linear combination $a^1ω_1 + a^2 ω_2 + a^3 ω_3$ is a symplectic form. Donaldson has conjectured that when the underlying manifold is compact, any such structure is isotopic in its cohomolgy class to a hyperkähler triple. We prove this conjecture for a hypersymplectic structure on $T^4$ which is invariant under the standard $T^3$ action. The proof uses the hypersymplectic flow, a geometric flow which attempts to deform a given hypersymplectic structure to a hyperkähler triple. We prove that on $T^4$, when starting from a $T^3$-invariant hypersymplectic structure, the flow exists for all time and converges modulo diffeomorphisms to the unique cohomologous hyperkähler structure.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Are We Ready for Planetary Exploration Robots? The TAIL-Plus Dataset for SLAM in Granular Environments
Authors:
Zirui Wang,
Chen Yao,
Yangtao Ge,
Guowei Shi,
Ningbo Yang,
Zheng Zhu,
Kewei Dong,
Hexiang Wei,
Zhenzhong Jia,
Jing Wu
Abstract:
So far, planetary surface exploration depends on various mobile robot platforms. The autonomous navigation and decision-making of these mobile robots in complex terrains largely rely on their terrain-aware perception, localization and mapping capabilities. In this paper we release the TAIL-Plus dataset, a new challenging dataset in deformable granular environments for planetary exploration robots,…
▽ More
So far, planetary surface exploration depends on various mobile robot platforms. The autonomous navigation and decision-making of these mobile robots in complex terrains largely rely on their terrain-aware perception, localization and mapping capabilities. In this paper we release the TAIL-Plus dataset, a new challenging dataset in deformable granular environments for planetary exploration robots, which is an extension to our previous work, TAIL (Terrain-Aware multI-modaL) dataset. We conducted field experiments on beaches that are considered as planetary surface analog environments for diverse sandy terrains. In TAIL-Plus dataset, we provide more sequences with multiple loops and expand the scene from day to night. Benefit from our sensor suite with modular design, we use both wheeled and quadruped robots for data collection. The sensors include a 3D LiDAR, three downward RGB-D cameras, a pair of global-shutter color cameras that can be used as a forward-looking stereo camera, an RTK-GPS device and an extra IMU. Our datasets are intended to help researchers developing multi-sensor simultaneous localization and mapping (SLAM) algorithms for robots in unstructured, deformable granular terrains. Our datasets and supplementary materials will be available at \url{https://tailrobot.github.io/}.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Thermal conversion of ultrathin nickel hydroxide for wide bandgap 2D nickel oxides
Authors:
Lu Ping,
Nicholas Russo,
Zifan Wang,
Ching-Hsiang Yao,
Kevin E. Smith,
Xi Ling
Abstract:
Wide bandgap (WBG) semiconductors (Eg >2.0 eV) are integral to the advancement of next generation electronics, optoelectronics, and power industries, owing to their capability for high temperature operation, high breakdown voltage and efficient light emission. Enhanced power efficiency and functional performance can be attained through miniaturization, specifically via the integration of device fa…
▽ More
Wide bandgap (WBG) semiconductors (Eg >2.0 eV) are integral to the advancement of next generation electronics, optoelectronics, and power industries, owing to their capability for high temperature operation, high breakdown voltage and efficient light emission. Enhanced power efficiency and functional performance can be attained through miniaturization, specifically via the integration of device fabrication into two-dimensional (2D) structure enabled by WBG 2D semiconductors. However, as an essential subgroup of WBG semiconductors, 2D transition metal oxides (TMOs) remain largely underexplored in terms of physical properties and applications in 2D opto-electronic devices, primarily due to the scarcity of sufficiently large 2D crystals. Thus, our goal is to develop synthesis pathways for 2D TMOs possessing large crystal domain (e.g. >10 nm), expanding the 2D TMOs family and providing insights for future engineering of 2D TMOs. Here, we demonstrate the synthesis of WBG 2D nickel oxide (NiO) (Eg > 2.7 eV) thermally converted from 2D nickel hydroxide (Ni(OH)2) with the lateral domain size larger than 10 um. Moreover, the conversion process is investigated using various microscopic techniques such as atomic force microscopy (AFM), Raman spectroscopy, transmission electron microscopy (TEM) and X-ray photoelectron spectroscopy (XPS), providing significant insights on the morphology and structure variation under different oxidative conditions. The electronic structure of the converted NixOy is further investigated using multiple soft X-ray spectroscopies, such as X-ray absorption (XAS) and emission spectroscopies (XES).
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Revealing mechanism of pore defect formation in laser directed energy deposition of aluminum alloy via in-situ synchrotron X-ray imaging
Authors:
Wei Liu,
Yuxiao Li,
Chunxia Yao,
Dongsheng Zhang,
Darui Sun,
Sen Chen,
Yu Wu,
Jun Wang,
Lei Lud,
Sheng-Nian Luo,
Ye Tao,
Bingbing Zhang
Abstract:
Laser metal additive manufacturing technology is capable of producing components with complex geometries and compositions that cannot be realized by conventional manufacturing methods. However, a large number of pores generated during the additive manufacturing process greatly affect the mechanical properties of the additively manufactured parts, and the mechanism of such pore generation has not b…
▽ More
Laser metal additive manufacturing technology is capable of producing components with complex geometries and compositions that cannot be realized by conventional manufacturing methods. However, a large number of pores generated during the additive manufacturing process greatly affect the mechanical properties of the additively manufactured parts, and the mechanism of such pore generation has not been revealed by direct observation clearly. Here, we report the mechanism of pore generation in the laser direct energy deposition process as revealed by {\it in-situ} high-speed high-resolution synchrotron X-ray imaging. We found that dissolution and re-precipitation of external gases and precipitation of metal vapors are the two main mechanisms of pore formation. We further explored the effects of different process parameters on the generation of pores and optimized the process to suppress pore generation. This work provides important insights into the formation of porosity defects during laser metal additive manufacturing, and can provide guidance for related process optimization.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Authors:
Chuwei Luo,
Yufan Shen,
Zhaoqing Zhu,
Qi Zheng,
Zhi Yu,
Cong Yao
Abstract:
Recently, leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has been proven very promising. However, previous works that employ LLMs/MLLMs for document understanding have not fully explored and utilized the document layout information, which is vital for precise document understanding. In this paper, we propose LayoutLLM, an LLM/MLLM bas…
▽ More
Recently, leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has been proven very promising. However, previous works that employ LLMs/MLLMs for document understanding have not fully explored and utilized the document layout information, which is vital for precise document understanding. In this paper, we propose LayoutLLM, an LLM/MLLM based method for document understanding. The core of LayoutLLM is a layout instruction tuning strategy, which is specially designed to enhance the comprehension and utilization of document layouts. The proposed layout instruction tuning strategy consists of two components: Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture the characteristics of document layout in Layout-aware Pre-training, three groups of pre-training tasks, corresponding to document-level, region-level and segment-level information, are introduced. Furthermore, a novel module called layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on regions relevant to the question and generate accurate answers. LayoutCoT is effective for boosting the performance of document understanding. Meanwhile, it brings a certain degree of interpretability, which could facilitate manual inspection and correction. Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding. The training data of the LayoutLLM is publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition
Authors:
Jianqiang Wan,
Sibo Song,
Wenwen Yu,
Yuliang Liu,
Wenqing Cheng,
Fei Huang,
Xiang Bai,
Cong Yao,
Zhibo Yang
Abstract:
Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous…
▽ More
Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Experimental Realization of Discrete Time Quasi-Crystals
Authors:
Guanghui He,
Bingtian Ye,
Ruotian Gong,
Changyu Yao,
Zhongyuan Liu,
Kater W. Murch,
Norman Y. Yao,
Chong Zu
Abstract:
Floquet (periodically driven) systems can give rise to unique non-equilibrium phases of matter without equilibrium analogs. The most prominent example is the realization of discrete time crystals. An intriguing question emerges: what other novel phases can manifest when the constraint of time periodicity is relaxed? In this study, we explore quantum systems subjected to a quasi-periodic drive. Lev…
▽ More
Floquet (periodically driven) systems can give rise to unique non-equilibrium phases of matter without equilibrium analogs. The most prominent example is the realization of discrete time crystals. An intriguing question emerges: what other novel phases can manifest when the constraint of time periodicity is relaxed? In this study, we explore quantum systems subjected to a quasi-periodic drive. Leveraging a strongly interacting spin ensemble in diamond, we identify the emergence of long-lived discrete time quasi-crystals. Unlike conventional time crystals, time quasi-crystals exhibit robust sub-harmonic responses at multiple incommensurate frequencies. Furthermore, we show that the multi-frequency nature of the quasi-periodic drive allows for the formation of diverse patterns associated with different discrete time quasi-crystalline phases. Our findings demonstrate the existence of non-equilibrium phases in quasi-Floquet settings, significantly broadening the catalog of novel phenomena in driven many-body quantum systems.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
TAIL: A Terrain-Aware Multi-Modal SLAM Dataset for Robot Locomotion in Deformable Granular Environments
Authors:
Chen Yao,
Yangtao Ge,
Guowei Shi,
Zirui Wang,
Ningbo Yang,
Zheng Zhu,
Hexiang Wei,
Yuntian Zhao,
Jing Wu,
Zhenzhong Jia
Abstract:
Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes…
▽ More
Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes. In this paper, we first propose a Terrain-Aware multI-modaL (TAIL) dataset tailored to deformable and sandy terrains. It incorporates various types of robotic proprioception and distinct ground interactions for the unique challenges and benchmark of multi-sensor fusion SLAM. The versatile sensor suite comprises stereo frame cameras, multiple ground-pointing RGB-D cameras, a rotating 3D LiDAR, an IMU, and an RTK device. This ensemble is hardware-synchronized, well-calibrated, and self-contained. Utilizing both wheeled and quadrupedal locomotion, we efficiently collect comprehensive sequences to capture rich unstructured scenarios. It spans the spectrum of scope, terrain interactions, scene changes, ground-level properties, and dynamic robot characteristics. We benchmark several state-of-the-art SLAM methods against ground truth and provide performance validations. Corresponding challenges and limitations are also reported. All associated resources are accessible upon request at \url{https://tailrobot.github.io/}.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
RU22Fact: Optimizing Evidence for Multilingual Explainable Fact-Checking on Russia-Ukraine Conflict
Authors:
Yirong Zeng,
Xiao Ding,
Yi Zhao,
Xiangyu Li,
Jie Zhang,
Chao Yao,
Ting Liu,
Bing Qin
Abstract:
Fact-checking is the task of verifying the factuality of a given claim by examining the available evidence. High-quality evidence plays a vital role in enhancing fact-checking systems and facilitating the generation of explanations that are understandable to humans. However, the provision of both sufficient and relevant evidence for explainable fact-checking systems poses a challenge. To tackle th…
▽ More
Fact-checking is the task of verifying the factuality of a given claim by examining the available evidence. High-quality evidence plays a vital role in enhancing fact-checking systems and facilitating the generation of explanations that are understandable to humans. However, the provision of both sufficient and relevant evidence for explainable fact-checking systems poses a challenge. To tackle this challenge, we propose a method based on a Large Language Model to automatically retrieve and summarize evidence from the Web. Furthermore, we construct RU22Fact, a novel multilingual explainable fact-checking dataset on the Russia-Ukraine conflict in 2022 of 16K samples, each containing real-world claims, optimized evidence, and referenced explanation. To establish a baseline for our dataset, we also develop an end-to-end explainable fact-checking system to verify claims and generate explanations. Experimental results demonstrate the prospect of optimized evidence in increasing fact-checking performance and also indicate the possibility of further progress in the end-to-end claim verification and explanation generation tasks.
△ Less
Submitted 26 March, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
A system capable of verifiably and privately screening global DNA synthesis
Authors:
Carsten Baum,
Jens Berlips,
Walther Chen,
Hongrui Cui,
Ivan Damgard,
Jiangbin Dong,
Kevin M. Esvelt,
Mingyu Gao,
Dana Gretton,
Leonard Foner,
Martin Kysel,
Kaiyi Zhang,
Juanru Li,
Xiang Li,
Omer Paneth,
Ronald L. Rivest,
Francesca Sage-Ling,
Adi Shamir,
Yue Shen,
Meicen Sun,
Vinod Vaikuntanathan,
Lynn Van Hauwe,
Theia Vogel,
Benjamin Weinstein-Raun,
Yun Wang
, et al. (5 additional authors not shown)
Abstract:
Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be used to ignite a pandemic. There are three complications. First, we don't n…
▽ More
Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be used to ignite a pandemic. There are three complications. First, we don't need to quickly update printers to deal with newly discovered currencies, whereas we regularly learn of new viruses and other biological threats. Second, anti-counterfeiting specifications on a local printer can't be extracted and misused by malicious actors, unlike information on biological threats. Finally, any screening must keep the inspected DNA sequences private, as they may constitute valuable trade secrets. Here we describe SecureDNA, a free, privacy-preserving, and fully automated system capable of verifiably screening all DNA synthesis orders of 30+ base pairs against an up-to-date database of hazards, and its operational performance and specificity when applied to 67 million base pairs of DNA synthesized by providers in the United States, Europe, and China.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition
Authors:
Yuyi Zhang,
Yuanzhi Zhu,
Dezhi Peng,
Peirong Zhang,
Zhenhua Yang,
Zhibo Yang,
Cong Yao,
Lianwen Jin
Abstract:
Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose…
▽ More
Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose HierCode, a novel and lightweight codebook that exploits the innate hierarchical nature of Chinese characters. HierCode employs a multi-hot encoding strategy, leveraging hierarchical binary tree encoding and prototype learning to create distinctive, informative representations for each character. This approach not only facilitates zero-shot recognition of OOV characters by utilizing shared radicals and structures but also excels in line-level recognition tasks by computing similarity with visual features, a notable advantage over existing methods. Extensive experiments across diverse benchmarks, including handwritten, scene, document, web, and ancient text, have showcased HierCode's superiority for both conventional and zero-shot Chinese character or text recognition, exhibiting state-of-the-art performance with significantly fewer parameters and fast inference speed.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Aligned Yet Large Dipoles: a SMEFT Study
Authors:
Quentin Bonnefoy,
Jonathan Kley,
Di Liu,
Alejo N. Rossia,
Chang-Yuan Yao
Abstract:
We study a non-universal flavor scenario at the level of the Standard Model Effective Field Theory, according to which the matrix of Wilson coefficients $c_{uW}$ of an up-type electroweak quark dipole operator is aligned with the up-type Yukawa coupling. Such an alignment usually follows from the assumption of Minimal Flavor Violation (MFV), away from which we step by allowing the entries of…
▽ More
We study a non-universal flavor scenario at the level of the Standard Model Effective Field Theory, according to which the matrix of Wilson coefficients $c_{uW}$ of an up-type electroweak quark dipole operator is aligned with the up-type Yukawa coupling. Such an alignment usually follows from the assumption of Minimal Flavor Violation (MFV), away from which we step by allowing the entries of $c_{uW}$ to be sizable along the first quark generations. A particular example, which we refer to as ``inverse hierarchy MFV", features Wilson coefficients inversely proportional to quark masses, and arises from BSM models respecting MFV and containing heavy fields that replicate the mass hierarchy of SM quarks. We then analyze the phenomenology driven by $c_{uW}$ at colliders and at lower-energy flavor experiments. We show that precision measurements of the process $pp\rightarrow W h\rightarrow γγ\ellν$ at FCC-$hh$ could set an upper bound on $|c_{uW}|\lesssim\mathcal{O}(10^{-2})(Λ/{\rm TeV})^{2}$, with $Λ$ the cutoff of the effective field theory. This bound is an order of magnitude stronger than the existing LHC bounds. Moreover, we estimate that $W h\rightarrow b\bar b \ellν$ at HL-LHC could also give competitive bounds. In the low-energy regime, we consider bounds arising from rare kaon decays, which turn out to be loose, $|c_{uW}^{11}|<\mathcal{O}(1)(Λ/{\rm TeV})^{2}$. We finally demonstrate that our flavor and operator assumptions can be derived from a weakly-coupled UV model, which we choose to simultaneously illustrate the UV origin of inverse hierarchy MFV.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
Authors:
Vikram Voleti,
Chun-Han Yao,
Mark Boss,
Adam Letts,
David Pankratz,
Dmitry Tochilkin,
Christian Laforte,
Robin Rombach,
Varun Jampani
Abstract:
We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affec…
▽ More
We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affecting the performance of 3D object generation. In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. We also propose improved 3D optimization techniques to use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental results on multiple datasets with 2D and 3D metrics as well as user study demonstrate SV3D's state-of-the-art performance on NVS as well as 3D reconstruction compared to prior works.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Lion: Minimizing Distributed Transactions through Adaptive Replica Provision (Extended Version)
Authors:
Qiushi Zheng,
Zhanhao Zhao,
Wei Lu,
Chang Yao,
Yuxing Chen,
Anqun Pan,
Xiaoyong Du
Abstract:
Distributed transaction processing often involves multiple rounds of cross-node communications, and therefore tends to be slow. To improve performance, existing approaches convert distributed transactions into single-node transactions by either migrating co-accessed partitions onto the same nodes or establishing a super node housing replicas of the entire database. However, migration-based methods…
▽ More
Distributed transaction processing often involves multiple rounds of cross-node communications, and therefore tends to be slow. To improve performance, existing approaches convert distributed transactions into single-node transactions by either migrating co-accessed partitions onto the same nodes or establishing a super node housing replicas of the entire database. However, migration-based methods might cause transactions to be blocked due to waiting for data migration, while the super node can become a bottleneck. In this paper, we present Lion, a novel transaction processing protocol that utilizes partition-based replication to reduce the occurrence of distributed transactions. Lion aims to assign a node with one replica from each partition involved in a given transaction's read or write operations. To ensure such a node is available, we propose an adaptive replica provision mechanism, enhanced with an LSTM-based workload prediction algorithm, to determine the appropriate node for locating replicas of co-accessed partitions. The adaptation of replica placement is conducted preemptively and asynchronously, thereby minimizing its impact on performance. By employing this adaptive replica placement strategy, we ensure that the majority of transactions can be efficiently processed on a single node without additional overhead. Only a small fraction of transactions will need to be treated as regular distributed transactions when such a node is unavailable. Consequently, Lion effectively minimizes distributed transactions while avoiding any disruption caused by data migration or the creation of a super node. We conduct extensive experiments to compare Lion against various transaction processing protocols. The results show that Lion achieves up to 2.7x higher throughput and 76.4% better scalability against these state-of-the-art approaches.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image
Authors:
Marco Pesavento,
Yuanlu Xu,
Nikolaos Sarafianos,
Robert Maier,
Ziyan Wang,
Chun-Han Yao,
Marco Volino,
Edmond Boyer,
Adrian Hilton,
Tony Tung
Abstract:
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries alon…
▽ More
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.
△ Less
Submitted 18 March, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
Benchmarking reconstructive spectrometer with multi-resonant cavities
Authors:
Chunhui Yao,
Kangning Xu,
Tianhua Lin,
Jie Ma,
Chumeng Yao,
Peng Bao,
Zhitian Shi,
Richard Penty,
Qixiang Cheng
Abstract:
Recent years have seen the rapid development of miniaturized reconstructive spectrometers (RSs), yet they still confront a range of technical challenges, such as bandwidth/resolution ratio, sensing speed, and/or power efficiency. Reported RS designs often suffer from insufficient decorrelation between sampling channels, which results in limited compressive sampling efficiency, in essence, due to i…
▽ More
Recent years have seen the rapid development of miniaturized reconstructive spectrometers (RSs), yet they still confront a range of technical challenges, such as bandwidth/resolution ratio, sensing speed, and/or power efficiency. Reported RS designs often suffer from insufficient decorrelation between sampling channels, which results in limited compressive sampling efficiency, in essence, due to inadequate engineering of sampling responses. This in turn leads to poor spectral-pixel-to-channel ratios (SPCRs), typically restricted at single digits. So far, there lacks a general guideline for manipulating RS sampling responses for the effectiveness of spectral information acquisition. In this study, we shed light on a fundamental parameter from the compressive sensing theory - the average mutual correlation coefficient v - and provide insight into how it serves as a critical benchmark in RS design with regards to the SPCR and reconstruction accuracy. To this end, we propose a novel RS design with multi-resonant cavities, consisting of a series of partial reflective interfaces. Such multi-cavity configuration offers an expansive parameter space, facilitating the superlative optimization of sampling matrices with minimized v. As a proof-of-concept demonstration, a single-shot, dual-band RS is implemented on a SiN platform, tailored for capturing signature spectral shapes across different wavelength regions, with customized photonic crystal nanobeam mirrors. Experimentally, the device demonstrates an overall operation bandwidth of 270 nm and a <0.5 nm resolution with only 15 sampling channels per band, leading to a record high SPCR of 18.0. Moreover, the proposed multi-cavity design can be readily adapted to various photonic platforms. For instance, we showcase that by employing multi-layer coatings, an ultra-broadband RS can be optimized to exhibit a 700 nm bandwidth with an SPCR of over 100.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Two-scale Neural Networks for Partial Differential Equations with Small Parameters
Authors:
Qiao Zhuang,
Chris Ziyi Yao,
Zhongqiang Zhang,
George Em Karniadakis
Abstract:
We propose a two-scale neural network method for solving partial differential equations (PDEs) with small parameters using physics-informed neural networks (PINNs). We directly incorporate the small parameters into the architecture of neural networks. The proposed method enables solving PDEs with small parameters in a simple fashion, without adding Fourier features or other computationally taxing…
▽ More
We propose a two-scale neural network method for solving partial differential equations (PDEs) with small parameters using physics-informed neural networks (PINNs). We directly incorporate the small parameters into the architecture of neural networks. The proposed method enables solving PDEs with small parameters in a simple fashion, without adding Fourier features or other computationally taxing searches of truncation parameters. Various numerical examples demonstrate reasonable accuracy in capturing features of large derivatives in the solutions caused by small parameters.
△ Less
Submitted 13 August, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Improved Regret for Bandit Convex Optimization with Delayed Feedback
Authors:
Yuanyu Wan,
Chang Yao,
Mingli Song,
Lijun Zhang
Abstract:
We investigate bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under an arbitrary delay. Let $n,T,\bar{d}$ denote the dimensionality, time horizon, and average delay, respectively. Previous studies have achieved an $O(\sqrt{n}T^{3/4}+(n\bar{d})^{1/3}T^{2/3})$ regret bound for this problem, whose delay-independent part matches the regret o…
▽ More
We investigate bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under an arbitrary delay. Let $n,T,\bar{d}$ denote the dimensionality, time horizon, and average delay, respectively. Previous studies have achieved an $O(\sqrt{n}T^{3/4}+(n\bar{d})^{1/3}T^{2/3})$ regret bound for this problem, whose delay-independent part matches the regret of the classical non-delayed bandit gradient descent algorithm. However, there is a large gap between its delay-dependent part, i.e., $O((n\bar{d})^{1/3}T^{2/3})$, and an existing $Ω(\sqrt{\bar{d}T})$ lower bound. In this paper, we illustrate that this gap can be filled in the worst case, where $\bar{d}$ is very close to the maximum delay $d$. Specifically, we first develop a novel algorithm, and prove that it enjoys a regret bound of $O(\sqrt{n}T^{3/4}+\sqrt{dT})$ in general. Compared with the previous result, our regret bound is better for $d=O((n\bar{d})^{2/3}T^{1/3})$, and the delay-dependent part is tight in the worst case. The primary idea is to decouple the joint effect of the delays and the bandit feedback on the regret by carefully incorporating the delayed bandit feedback with a blocking update mechanism. Furthermore, we show that the proposed algorithm can improve the regret bound to $O((nT)^{2/3}\log^{1/3}T+d\log T)$ for strongly convex functions. Finally, if the action sets are unconstrained, we demonstrate that it can be simply extended to achieve an $O(n\sqrt{T\log T}+d\log T)$ regret bound for strongly convex and smooth functions.
△ Less
Submitted 23 June, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Autonomous Data Selection with Language Models for Mathematical Texts
Authors:
Yifan Zhang,
Yifan Luo,
Yang Yuan,
Andrew Chi-Chih Yao
Abstract:
To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach Autonomous Data Selection (AutoDS) utilizes meta-prompted language models as zero-shot verifiers…
▽ More
To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach Autonomous Data Selection (AutoDS) utilizes meta-prompted language models as zero-shot verifiers to evaluate and select high-quality mathematical content autonomously. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter language model on our curated dataset, achieving substantial improvements in downstream performance on the MATH, GSM8K, and BIG-Bench Hard (BBH) tasks with a token amount reduced by orders of magnitude compared to previous continual pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to state-of-the-art baselines, underscoring the potential of our approach in enhancing models' mathematical reasoning capabilities. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.
△ Less
Submitted 2 April, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
A Survey of a Random Matrix Model for a Family of Cusp Forms
Authors:
Owen Barrett,
Zoë X. Batterman,
Aditya Jambhale,
Steven J. Miller,
Akash L. Narayanan,
Kishan Sharma,
Chris Yao
Abstract:
The Katz-Sarnak philosophy states that statistics of zeros of $L$-function families near the central point as the conductors tend to infinity agree with those of eigenvalues of random matrix ensembles as the matrix size tends to infinity. While numerous results support this conjecture, S. J. Miller observed that for finite conductors, very different behavior can occur for zeros near the central po…
▽ More
The Katz-Sarnak philosophy states that statistics of zeros of $L$-function families near the central point as the conductors tend to infinity agree with those of eigenvalues of random matrix ensembles as the matrix size tends to infinity. While numerous results support this conjecture, S. J. Miller observed that for finite conductors, very different behavior can occur for zeros near the central point in elliptic curve families. This led to the excised model of Dueñez, Huynh, Keating, Miller, and Snaith, whose predictions for quadratic twists of a given elliptic curve are beautifully fit by the data. The key ingredients are relating the discretization of central values of the $L$-functions to excising matrices based on the value of the characteristic polynomials at 1 and using lower order terms (in statistics such as the one-level density and pair-correlation) to adjust the matrix size. We discuss recent successes by the authors in extending this model to a family of quadratic twists of finite conductor of a given holomorphic cuspidal newform of level an odd prime level. In particular, we predict very little repulsion for forms with weight greater than 2.
△ Less
Submitted 17 April, 2024; v1 submitted 28 January, 2024;
originally announced February 2024.
-
Optically-Trapped Nanodiamond-Relaxometry Detection of Nanomolar Paramagnetic Spins in Aqueous Environments
Authors:
Shiva Iyer,
Changyu Yao,
Olivia Lazorik,
Pengyun Wang,
Gianna Glenn,
Michael Mohs,
Yinyao Shi,
Michael Mansour,
Erik Henriksen,
Kater Murch,
Shankar Mukherji,
Chong Zu
Abstract:
Probing electrical and magnetic properties in aqueous environments remains a frontier challenge in nanoscale sensing. Our inability to do so with quantitative accuracy imposes severe limitations, for example, on our understanding of the ionic environments in a diverse array of systems, ranging from novel materials to the living cell. The Nitrogen-Vacancy (NV) center in fluorescent nanodiamonds (FN…
▽ More
Probing electrical and magnetic properties in aqueous environments remains a frontier challenge in nanoscale sensing. Our inability to do so with quantitative accuracy imposes severe limitations, for example, on our understanding of the ionic environments in a diverse array of systems, ranging from novel materials to the living cell. The Nitrogen-Vacancy (NV) center in fluorescent nanodiamonds (FNDs) has emerged as a good candidate to sense temperature, pH, and the concentration of paramagnetic species at the nanoscale, but comes with several hurdles such as particle-to-particle variation which render calibrated measurements difficult, and the challenge to tightly confine and precisely position sensors in aqueous environment. To address this, we demonstrate relaxometry with NV centers within optically-trapped FNDs. In a proof of principle experiment, we show that optically-trapped FNDs enable highly reproducible nanomolar sensitivity to the paramagnetic ion, (\mathrm{Gd}^{3+}). We capture the three distinct phases of our experimental data by devising a model analogous to nanoscale Langmuir adsorption combined with spin coherence dynamics. Our work provides a basis for routes to sense free paramagnetic ions and molecules in biologically relevant conditions.
△ Less
Submitted 20 February, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Limiting Behavior in Missing Sums of Sumsets
Authors:
Aditya Jambhale,
Rauan Kaldybayev,
Steven J. Miller,
Chris Yao
Abstract:
We study $|A + A|$ as a random variable, where $A \subseteq \{0, \dots, N\}$ is a random subset such that each $0 \le n \le N$ is included with probability $0 < p < 1$, and where $A + A$ is the set of sums $a + b$ for $a,b$ in $A$. Lazarev, Miller, and O'Bryant studied the distribution of $2N + 1 - |A + A|$, the number of summands not represented in $A + A$ when $p = 1/2$. A recent paper by Chu, K…
▽ More
We study $|A + A|$ as a random variable, where $A \subseteq \{0, \dots, N\}$ is a random subset such that each $0 \le n \le N$ is included with probability $0 < p < 1$, and where $A + A$ is the set of sums $a + b$ for $a,b$ in $A$. Lazarev, Miller, and O'Bryant studied the distribution of $2N + 1 - |A + A|$, the number of summands not represented in $A + A$ when $p = 1/2$. A recent paper by Chu, King, Luntzlara, Martinez, Miller, Shao, Sun, and Xu generalizes this to all $p\in (0,1)$, calculating the first and second moments of the number of missing summands and establishing exponential upper and lower bounds on the probability of missing exactly $n$ summands, mostly working in the limit of large $N$. We provide exponential bounds on the probability of missing at least $n$ summands, find another expression for the second moment of the number of missing summands, extract its leading-order behavior in the limit of small $p$, and show that the variance grows asymptotically slower than the mean, proving that for small $p$, the number of missing summands is very likely to be near its expected value.
△ Less
Submitted 1 February, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Augmenting Math Word Problems via Iterative Question Composing
Authors:
Haoxiong Liu,
Yifan Zhang,
Yifan Luo,
Andrew Chi-Chih Yao
Abstract:
Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base langu…
▽ More
Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base language models. Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM. The MMIQC dataset is available on the HuggingFace hub at https://huggingface.co/datasets/Vivacem/MMIQC. Our code is available at https://github.com/iiis-ai/IterativeQuestionComposing.
△ Less
Submitted 10 February, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
Subsonic Euler flows in a three-dimensional finitely long cylinder with arbitrary cross section
Authors:
Shangkun Weng,
Changkui Yao
Abstract:
This paper concerns the well-posedness of subsonic flows in a three-dimensional finitely long cylinder with arbitrary cross section. We establish the existence and uniqueness of subsonic flows in the Sobolev space by prescribing the normal component of the momentum, the vorticity, the entropy, the Bernoulli's quantity at the entrance and the normal component of the momentum at the exit. One of the…
▽ More
This paper concerns the well-posedness of subsonic flows in a three-dimensional finitely long cylinder with arbitrary cross section. We establish the existence and uniqueness of subsonic flows in the Sobolev space by prescribing the normal component of the momentum, the vorticity, the entropy, the Bernoulli's quantity at the entrance and the normal component of the momentum at the exit. One of the key points in the analysis is to utilize the deformation-curl decomposition for the steady Euler system introduced in \cite{WX19} to deal with the hyperbolic and elliptic modes. Another one is to employ the separation of variables to improve the regularity of solutions to a deformation-curl system near the intersection between the entrance and exit with the cylinder wall.
△ Less
Submitted 13 January, 2024;
originally announced January 2024.
-
MatSAM: Efficient Extraction of Microstructures of Materials via Visual Large Model
Authors:
Changtai Li,
Xu Han,
Chao Yao,
Xiaojuan Ban
Abstract:
Efficient and accurate extraction of microstructures in micrographs of materials is essential in process optimization and the exploration of structure-property relationships. Deep learning-based image segmentation techniques that rely on manual annotation are laborious and time-consuming and hardly meet the demand for model transferability and generalization on various source images. Segment Anyth…
▽ More
Efficient and accurate extraction of microstructures in micrographs of materials is essential in process optimization and the exploration of structure-property relationships. Deep learning-based image segmentation techniques that rely on manual annotation are laborious and time-consuming and hardly meet the demand for model transferability and generalization on various source images. Segment Anything Model (SAM), a large visual model with powerful deep feature representation and zero-shot generalization capabilities, has provided new solutions for image segmentation. In this paper, we propose MatSAM, a general and efficient microstructure extraction solution based on SAM. A simple yet effective point-based prompt generation strategy is designed, grounded on the distribution and shape of microstructures. Specifically, in an unsupervised and training-free way, it adaptively generates prompt points for different microscopy images, fuses the centroid points of the coarsely extracted region of interest (ROI) and native grid points, and integrates corresponding post-processing operations for quantitative characterization of microstructures of materials. For common microstructures including grain boundary and multiple phases, MatSAM achieves superior zero-shot segmentation performance to conventional rule-based methods and is even preferable to supervised learning methods evaluated on 16 microscopy datasets whose micrographs are imaged by the optical microscope (OM) and scanning electron microscope (SEM). Especially, on 4 public datasets, MatSAM shows unexpected competitive segmentation performance against their specialist models. We believe that, without the need for human labeling, MatSAM can significantly reduce the cost of quantitative characterization and statistical analysis of extensive microstructures of materials, and thus accelerate the design of new materials.
△ Less
Submitted 2 March, 2024; v1 submitted 10 January, 2024;
originally announced January 2024.
-
Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics
Authors:
Xueyuan Yang,
Chao Yao,
Xiaojuan Ban
Abstract:
Leveraging wearable devices for motion reconstruction has emerged as an economical and viable technique. Certain methodologies employ sparse Inertial Measurement Units (IMUs) on the human body and harness data-driven strategies to model human poses. However, the reconstruction of motion based solely on sparse IMUs data is inherently fraught with ambiguity, a consequence of numerous identical IMU r…
▽ More
Leveraging wearable devices for motion reconstruction has emerged as an economical and viable technique. Certain methodologies employ sparse Inertial Measurement Units (IMUs) on the human body and harness data-driven strategies to model human poses. However, the reconstruction of motion based solely on sparse IMUs data is inherently fraught with ambiguity, a consequence of numerous identical IMU readings corresponding to different poses. In this paper, we explore the spatial importance of multiple sensors, supervised by text that describes specific actions. Specifically, uncertainty is introduced to derive weighted features for each IMU. We also design a Hierarchical Temporal Transformer (HTT) and apply contrastive learning to achieve precise temporal and feature alignment of sensor data with textual semantics. Experimental results demonstrate our proposed approach achieves significant improvements in multiple metrics compared to existing methods. Notably, with textual supervision, our method not only differentiates between ambiguous actions such as sitting and standing but also produces more precise and natural motion.
△ Less
Submitted 26 December, 2023;
originally announced January 2024.
-
LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training
Authors:
Rujiao Long,
Hangdi Xing,
Zhibo Yang,
Qi Zheng,
Zhi Yu,
Cong Yao,
Fei Huang
Abstract:
Table structure recognition (TSR) aims at extracting tables in images into machine-understandable formats. Recent methods solve this problem by predicting the adjacency relations of detected cell boxes or learning to directly generate the corresponding markup sequences from the table images. However, existing approaches either count on additional heuristic rules to recover the table structures, or…
▽ More
Table structure recognition (TSR) aims at extracting tables in images into machine-understandable formats. Recent methods solve this problem by predicting the adjacency relations of detected cell boxes or learning to directly generate the corresponding markup sequences from the table images. However, existing approaches either count on additional heuristic rules to recover the table structures, or face challenges in capturing long-range dependencies within tables, resulting in increased complexity. In this paper, we propose an alternative paradigm. We model TSR as a logical location regression problem and propose a new TSR framework called LORE, standing for LOgical location REgression network, which for the first time regresses logical location as well as spatial location of table cells in a unified network. Our proposed LORE is conceptually simpler, easier to train, and more accurate than other paradigms of TSR. Moreover, inspired by the persuasive success of pre-trained models on a number of computer vision and natural language processing tasks, we propose two pre-training tasks to enrich the spatial and logical representations at the feature level of LORE, resulting in an upgraded version called LORE++. The incorporation of pre-training in LORE++ has proven to enjoy significant advantages, leading to a substantial enhancement in terms of accuracy, generalization, and few-shot capability compared to its predecessor. Experiments on standard benchmarks against methods of previous paradigms demonstrate the superiority of LORE++, which highlights the potential and promising prospect of the logical location regression paradigm for TSR.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning
Authors:
Zhenhua Yang,
Dezhi Peng,
Yuxin Kong,
Yuyi Zhang,
Cong Yao,
Lianwen Jin
Abstract:
Automatic font generation is an imitation task, which aims to create a font library that mimics the style of reference images while preserving the content from source images. Although existing font generation methods have achieved satisfactory performance, they still struggle with complex characters and large style variations. To address these issues, we propose FontDiffuser, a diffusion-based ima…
▽ More
Automatic font generation is an imitation task, which aims to create a font library that mimics the style of reference images while preserving the content from source images. Although existing font generation methods have achieved satisfactory performance, they still struggle with complex characters and large style variations. To address these issues, we propose FontDiffuser, a diffusion-based image-to-image one-shot font generation method, which innovatively models the font imitation task as a noise-to-denoise paradigm. In our method, we introduce a Multi-scale Content Aggregation (MCA) block, which effectively combines global and local content cues across different scales, leading to enhanced preservation of intricate strokes of complex characters. Moreover, to better manage the large variations in style transfer, we propose a Style Contrastive Refinement (SCR) module, which is a novel structure for style representation learning. It utilizes a style extractor to disentangle styles from images, subsequently supervising the diffusion model via a meticulously designed style contrastive loss. Extensive experiments demonstrate FontDiffuser's state-of-the-art performance in generating diverse characters and styles. It consistently excels on complex characters and large style changes compared to previous methods. The code is available at https://github.com/yeungchenwa/FontDiffuser.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Rethinking Causal Relationships Learning in Graph Neural Networks
Authors:
Hang Gao,
Chengyu Yao,
Jiangmeng Li,
Lingyu Si,
Yifan Jin,
Fengge Wu,
Changwen Zheng,
Huaping Liu
Abstract:
Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conductin…
▽ More
Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conducting an in-depth analysis specifically targeting the causal modeling prowess of GNNs remains an unresolved issue. In order to comprehensively analyze various GNN models from a causal learning perspective, we constructed an artificially synthesized dataset with known and controllable causal relationships between data and labels. The rationality of the generated data is further ensured through theoretical foundations. Drawing insights from analyses conducted using our dataset, we introduce a lightweight and highly adaptable GNN module designed to strengthen GNNs' causal learning capabilities across a diverse range of tasks. Through a series of experiments conducted on both synthetic datasets and other real-world datasets, we empirically validate the effectiveness of the proposed module.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution
Authors:
Qi Tang,
Yao Zhao,
Meiqin Liu,
Jian Jin,
Chao Yao
Abstract:
As a critical clue of video super-resolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is…
▽ More
As a critical clue of video super-resolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is modeled as instances, events, and scenes via a Semantic Extractor. Those semantics assist the Pixel Enhancer in understanding the recovered contents and generating more realistic visual results. The distilled global semantics embody the scene information of each frame, while the instance-specific semantics assemble the spatial-temporal contexts related to each instance. Furthermore, we devise a Semantics-Powered Attention Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic knowledge, composed of a Global Perspective Shifter (GPS) and an Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module generates pairs of affine transformation parameters for pixel-level feature modulation conditioned on global semantics. After that, the ISEE module harnesses the attention mechanism to align the adjacent frames in the instance-centric semantic space. In addition, we incorporate a simple yet effective pre-alignment module to alleviate the difficulty of model training. Extensive experiments demonstrate the superiority of our model over existing state-of-the-art VSR methods.
△ Less
Submitted 19 January, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.