-
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Authors:
Huajian Xin,
Z. Z. Ren,
Junxiao Song,
Zhihong Shao,
Wanjia Zhao,
Haocheng Wang,
Bo Liu,
Liyue Zhang,
Xuan Lu,
Qiushi Du,
Wenjun Gao,
Qihao Zhu,
Dejian Yang,
Zhibin Gou,
Z. F. Wu,
Fuli Luo,
Chong Ruan
Abstract:
We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-…
▽ More
We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. Further refinement is achieved through reinforcement learning from proof assistant feedback (RLPAF). Beyond the single-pass whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5 demonstrates significant improvements over DeepSeek-Prover-V1, achieving new state-of-the-art results on the test set of the high school level miniF2F benchmark ($63.5\%$) and the undergraduate level ProofNet benchmark ($25.3\%$).
△ Less
Submitted 15 August, 2024;
originally announced August 2024.
-
Are Large Language Models a Good Replacement of Taxonomies?
Authors:
Yushi Sun,
Hao Xin,
Kai Sun,
Yifan Ethan Xu,
Xiao Yang,
Xin Luna Dong,
Nan Tang,
Lei Chen
Abstract:
Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask…
▽ More
Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.
△ Less
Submitted 20 June, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
"I see it as a wellspring for my positive and upward journey in life.": Understanding Current Practices of Assistive Technology's Customized Modification in China
Authors:
Kexin Yang,
Junyi Wu,
Haokun Xin,
Jiangtao Gong
Abstract:
Due to the significant differences in physical conditions and living environments of people with disabilities, standardized assistive technologies (ATs) often fail to meet their needs. Modified AT, especially DIY (Do It Yourself) ATs, are a popular solution in many high-income countries, but there is a lack of documentation for low- and middle-income areas, especially in China, where the culture o…
▽ More
Due to the significant differences in physical conditions and living environments of people with disabilities, standardized assistive technologies (ATs) often fail to meet their needs. Modified AT, especially DIY (Do It Yourself) ATs, are a popular solution in many high-income countries, but there is a lack of documentation for low- and middle-income areas, especially in China, where the culture of philanthropy is undeveloped. To understand the current situation in this paper, we conducted semi-structured interviews with 10 individuals with disabilities using modified ATs and 10 individuals involved in providing these, including family members, standard assistive device manufacturers, and individuals employed for their modification skills, etc. Based on the results of the thematic analysis, we have summarized the general process of modified ATs for people with disabilities in China and the benefits these devices bring. We found that modified ATs not only make the lives of people with disabilities more comfortable and convenient but also bring them confidence, reduce social pressure, and even help them achieve self-realization. Additionally, we summarized the challenges they encountered before, during, and after the modification, including awareness gaps, family resistance, a lack of a business model, and so on. Specifically, we conducted a special case study about the typical business models and challenges currently faced by AT modification organizations in China. Our research provides important design foundations and research insights for the future of universal and personalized production of AT.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
CRAG -- Comprehensive RAG Benchmark
Authors:
Xiao Yang,
Kai Sun,
Hao Xin,
Yushi Sun,
Nikita Bhalla,
Xiangsen Chen,
Sajal Choudhary,
Rongze Daniel Gui,
Ziran Will Jiang,
Ziyu Jiang,
Lingkun Kong,
Brian Moran,
Jiaqi Wang,
Yifan Ethan Xu,
An Yan,
Chenyu Yang,
Eting Yuan,
Hanwen Zha,
Nan Tang,
Lei Chen,
Nicolas Scheffer,
Yue Liu,
Nirav Shah,
Rakesh Wanga,
Anuj Kumar
, et al. (2 additional authors not shown)
Abstract:
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering bench…
▽ More
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
KGLink: A column type annotation method that combines knowledge graph and pre-trained language model
Authors:
Yubo Wang,
Hao Xin,
Lei Chen
Abstract:
The semantic annotation of tabular data plays a crucial role in various downstream tasks. Previous research has proposed knowledge graph (KG)-based and deep learning-based methods, each with its inherent limitations. KG-based methods encounter difficulties annotating columns when there is no match for column cells in the KG. Moreover, KG-based methods can provide multiple predictions for one colum…
▽ More
The semantic annotation of tabular data plays a crucial role in various downstream tasks. Previous research has proposed knowledge graph (KG)-based and deep learning-based methods, each with its inherent limitations. KG-based methods encounter difficulties annotating columns when there is no match for column cells in the KG. Moreover, KG-based methods can provide multiple predictions for one column, making it challenging to determine the semantic type with the most suitable granularity for the dataset. This type granularity issue limits their scalability.
On the other hand, deep learning-based methods face challenges related to the valuable context missing issue. This occurs when the information within the table is insufficient for determining the correct column type.
This paper presents KGLink, a method that combines WikiData KG information with a pre-trained deep learning language model for table column annotation, effectively addressing both type granularity and valuable context missing issues. Through comprehensive experiments on widely used tabular datasets encompassing numeric and string columns with varying type granularity, we showcase the effectiveness and efficiency of KGLink. By leveraging the strengths of KGLink, we successfully surmount challenges related to type granularity and valuable context issues, establishing it as a robust solution for the semantic annotation of tabular data.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Proving Theorems Recursively
Authors:
Haiming Wang,
Huajian Xin,
Zhengying Liu,
Wenda Li,
Yinya Huang,
Jianqiao Lu,
Zhicheng Yang,
Jing Tang,
Jian Yin,
Zhenguo Li,
Xiaodan Liang
Abstract:
Recent advances in automated theorem proving leverages language models to explore expanded search spaces by step-by-step proof generation. However, such approaches are usually based on short-sighted heuristics (e.g., log probability or value function scores) that potentially lead to suboptimal or even distracting subgoals, preventing us from finding longer proofs. To address this challenge, we pro…
▽ More
Recent advances in automated theorem proving leverages language models to explore expanded search spaces by step-by-step proof generation. However, such approaches are usually based on short-sighted heuristics (e.g., log probability or value function scores) that potentially lead to suboptimal or even distracting subgoals, preventing us from finding longer proofs. To address this challenge, we propose POETRY (PrOvE Theorems RecursivelY), which proves theorems in a recursive, level-by-level manner in the Isabelle theorem prover. Unlike previous step-by-step methods, POETRY searches for a verifiable sketch of the proof at each level and focuses on solving the current level's theorem or conjecture. Detailed proofs of intermediate conjectures within the sketch are temporarily replaced by a placeholder tactic called sorry, deferring their proofs to subsequent levels. This approach allows the theorem to be tackled incrementally by outlining the overall theorem at the first level and then solving the intermediate conjectures at deeper levels. Experiments are conducted on the miniF2F and PISA datasets and significant performance gains are observed in our POETRY approach over state-of-the-art methods. POETRY on miniF2F achieves an average proving success rate improvement of 5.1%. Moreover, we observe a substantial increase in the maximum proof length found by POETRY, from 10 to 26.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Authors:
Huajian Xin,
Daya Guo,
Zhihong Shao,
Zhizhou Ren,
Qihao Zhu,
Bo Liu,
Chong Ruan,
Wenda Li,
Xiaodan Liang
Abstract:
Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and u…
▽ More
Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
LitSim: A Conflict-aware Policy for Long-term Interactive Traffic Simulation
Authors:
Haojie Xin,
Xiaodong Zhang,
Renzhi Tang,
Songyang Yan,
Qianrui Zhao,
Chunze Yang,
Wen Cui,
Zijiang Yang
Abstract:
Simulation is pivotal in evaluating the performance of autonomous driving systems due to the advantages of high efficiency and low cost compared to on-road testing. Bridging the gap between simulation and the real world requires realistic agent behaviors. However, the existing works have the following shortcomings in achieving this goal: (1) log replay offers realistic scenarios but often leads to…
▽ More
Simulation is pivotal in evaluating the performance of autonomous driving systems due to the advantages of high efficiency and low cost compared to on-road testing. Bridging the gap between simulation and the real world requires realistic agent behaviors. However, the existing works have the following shortcomings in achieving this goal: (1) log replay offers realistic scenarios but often leads to collisions due to the absence of dynamic interactions, and (2) both heuristic-based and data-based solutions, which are parameterized and trained on real-world datasets, encourage interactions but often deviate from real-world data over long horizons. In this work, we propose LitSim, a long-term interactive simulation approach that maximizes realism by minimizing the interventions in the log. Specifically, our approach primarily uses log replay to ensure realism and intervenes only when necessary to prevent potential conflicts. We then encourage interactions among the agents and resolve the conflicts, thereby reducing the risk of unrealistic behaviors. We train and validate our model on the real-world dataset NGSIM, and the experimental results demonstrate that LitSim outperforms the currently popular approaches in terms of realism and reactivity.
△ Less
Submitted 1 May, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data
Authors:
Yinya Huang,
Xiaohan Lin,
Zhengying Liu,
Qingxing Cao,
Huajian Xin,
Haiming Wang,
Zhenguo Li,
Linqi Song,
Xiaodan Liang
Abstract:
Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still face important challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectivenes…
▽ More
Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still face important challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces MUSTARD, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. MUSTARD synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then, it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed MUSTARD, we present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves a 15.41% average relative performance gain in automated theorem proving, and 8.18% in math word problems. Codes and data are available at https://github.com/Eleanor-H/MUSTARD.
△ Less
Submitted 22 May, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Human Perception-Inspired Grain Segmentation Refinement Using Conditional Random Fields
Authors:
Doruk Aksoy,
Huolin L. Xin,
Timothy J. Rupert,
William J. Bowman
Abstract:
Accurate segmentation of interconnected line networks, such as grain boundaries in polycrystalline material microstructures, poses a significant challenge due to the fragmented masks produced by conventional computer vision algorithms, including convolutional neural networks. These algorithms struggle with thin masks, often necessitating intricate post-processing for effective contour closure and…
▽ More
Accurate segmentation of interconnected line networks, such as grain boundaries in polycrystalline material microstructures, poses a significant challenge due to the fragmented masks produced by conventional computer vision algorithms, including convolutional neural networks. These algorithms struggle with thin masks, often necessitating intricate post-processing for effective contour closure and continuity. Addressing this issue, this paper introduces a fast, high-fidelity post-processing technique, leveraging domain knowledge about grain boundary connectivity and employing conditional random fields and perceptual grouping rules. This approach significantly enhances segmentation mask accuracy, achieving a 79% segment identification accuracy in validation with a U-Net model on electron microscopy images of a polycrystalline oxide. Additionally, a novel grain alignment metric is introduced, showing a 51% improvement in grain alignment, providing a more detailed assessment of segmentation performance for complex microstructures. This method not only enables rapid and accurate segmentation but also facilitates an unprecedented level of data analysis, significantly improving the statistical representation of grain boundary networks, making it suitable for a range of disciplines where precise segmentation of interconnected line networks is essential.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
LEGO-Prover: Neural Theorem Proving with Growing Libraries
Authors:
Haiming Wang,
Huajian Xin,
Chuanyang Zheng,
Lin Li,
Zhengying Liu,
Qingxing Cao,
Yinya Huang,
Jing Xiong,
Han Shi,
Enze Xie,
Jian Yin,
Zhenguo Li,
Heng Liao,
Xiaodan Liang
Abstract:
Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during th…
▽ More
Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during the whole theorem proving process. However, as we all know, creating new useful theorems or even new theories is not only helpful but crucial and necessary for advancing mathematics and proving harder and deeper results. In this work, we present LEGO-Prover, which employs a growing skill library containing verified lemmas as skills to augment the capability of LLMs used in theorem proving. By constructing the proof modularly, LEGO-Prover enables LLMs to utilize existing skills retrieved from the library and to create new skills during the proving process. These skills are further evolved (by prompting an LLM) to enrich the library on another scale. Modular and reusable skills are constantly added to the library to enable tackling increasingly intricate mathematical problems. Moreover, the learned library further bridges the gap between human proofs and formal proofs by making it easier to impute missing steps. LEGO-Prover advances the state-of-the-art pass rate on miniF2F-valid (48.0% to 57.0%) and miniF2F-test (45.5% to 47.1%). During the proving process, LEGO-Prover also manages to generate over 20,000 skills (theorems/lemmas) and adds them to the growing library. Our ablation study indicates that these newly added skills are indeed helpful for proving theorems, resulting in an improvement from a success rate of 47.1% to 50.4%. We also release our code and all the generated skills.
△ Less
Submitted 27 October, 2023; v1 submitted 1 October, 2023;
originally announced October 2023.
-
Lyra: Orchestrating Dual Correction in Automated Theorem Proving
Authors:
Chuanyang Zheng,
Haiming Wang,
Enze Xie,
Zhengying Liu,
Jiankai Sun,
Huajian Xin,
Jianhao Shen,
Zhenguo Li,
Yu Li
Abstract:
Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framewo…
▽ More
Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framework that employs two distinct correction mechanisms: Tool Correction (TC) and Conjecture Correction (CC). To implement Tool Correction in the post-processing of formal proofs, we leverage prior knowledge to utilize predefined prover tools (e.g., Sledgehammer) for guiding the replacement of incorrect tools. Tool Correction significantly contributes to mitigating hallucinations, thereby improving the overall accuracy of the proof. In addition, we introduce Conjecture Correction, an error feedback mechanism designed to interact with prover to refine formal proof conjectures with prover error messages. Compared to the previous refinement framework, the proposed Conjecture Correction refines generation with instruction but does not collect paired (generation, error & refinement) prompts. Our method has achieved state-of-the-art (SOTA) performance on both miniF2F validation (48.0% -> 55.3%) and test (45.5% -> 51.2%). We also present 3 IMO problems solved by Lyra. We believe Tool Correction (post-process for hallucination mitigation) and Conjecture Correction (subgoal adjustment from interaction with environment) could provide a promising avenue for future research in this field.
△ Less
Submitted 24 August, 2024; v1 submitted 27 September, 2023;
originally announced September 2023.
-
FIMO: A Challenge Formal Dataset for Automated Theorem Proving
Authors:
Chengwu Liu,
Jianhao Shen,
Huajian Xin,
Zhengying Liu,
Ye Yuan,
Haiming Wang,
Wei Ju,
Chuanyang Zheng,
Yichun Yin,
Lin Li,
Ming Zhang,
Qun Liu
Abstract:
We present FIMO, an innovative dataset comprising formal mathematical problem statements sourced from the International Mathematical Olympiad (IMO) Shortlisted Problems. Designed to facilitate advanced automated theorem proving at the IMO level, FIMO is currently tailored for the Lean formal language. It comprises 149 formal problem statements, accompanied by both informal problem descriptions and…
▽ More
We present FIMO, an innovative dataset comprising formal mathematical problem statements sourced from the International Mathematical Olympiad (IMO) Shortlisted Problems. Designed to facilitate advanced automated theorem proving at the IMO level, FIMO is currently tailored for the Lean formal language. It comprises 149 formal problem statements, accompanied by both informal problem descriptions and their corresponding LaTeX-based informal proofs. Through initial experiments involving GPT-4, our findings underscore the existing limitations in current methodologies, indicating a substantial journey ahead before achieving satisfactory IMO-level automated theorem proving outcomes.
△ Less
Submitted 5 December, 2023; v1 submitted 8 September, 2023;
originally announced September 2023.
-
Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification
Authors:
Ruijie Zhang,
Qiaozhe Zhang,
Yingzhuang Liu,
Hao Xin,
Yan Liu,
Xinggang Wang
Abstract:
Whole slide image (WSI) refers to a type of high-resolution scanned tissue image, which is extensively employed in computer-assisted diagnosis (CAD). The extremely high resolution and limited availability of region-level annotations make employing deep learning methods for WSI-based digital diagnosis challenging. Recently integrating multiple instance learning (MIL) and Transformer for WSI analysi…
▽ More
Whole slide image (WSI) refers to a type of high-resolution scanned tissue image, which is extensively employed in computer-assisted diagnosis (CAD). The extremely high resolution and limited availability of region-level annotations make employing deep learning methods for WSI-based digital diagnosis challenging. Recently integrating multiple instance learning (MIL) and Transformer for WSI analysis shows very promising results. However, designing effective Transformers for this weakly-supervised high-resolution image analysis is an underexplored yet important problem. In this paper, we propose a Multi-level MIL (MMIL) scheme by introducing a hierarchical structure to MIL, which enables efficient handling of MIL tasks involving a large number of instances. Based on MMIL, we instantiated MMIL-Transformer, an efficient Transformer model with windowed exact self-attention for large-scale MIL tasks. To validate its effectiveness, we conducted a set of experiments on WSI classification tasks, where MMIL-Transformer demonstrate superior performance compared to existing state-of-the-art methods, i.e., 96.80% test AUC and 97.67% test accuracy on the CAMELYON16 dataset, 99.04% test AUC and 94.37% test accuracy on the TCGA-NSCLC dataset, respectively. All code and pre-trained models are available at: https://github.com/hustvl/MMIL-Transformer
△ Less
Submitted 5 September, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
Authors:
Xingqun Qi,
Chen Liu,
Lincheng Li,
Jie Hou,
Haoran Xin,
Xin Yu
Abstract:
Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture, a novel framework for synthesizing vivid and diverse emotional co-s…
▽ More
Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures. Our code and dataset will be released at the project page: https://xingqunqi-lab.github.io/Emotion-Gesture-Web/
△ Less
Submitted 3 January, 2024; v1 submitted 30 May, 2023;
originally announced May 2023.
-
WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation
Authors:
Lianghui Zhu,
Yingyue Li,
Jiemin Fang,
Yan Liu,
Hao Xin,
Wenyu Liu,
Xinggang Wang
Abstract:
This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the im…
▽ More
This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the importance of attention heads, while the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results to complete the WSSS task. We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014. Code is available at https://github.com/hustvl/WeakTr.
△ Less
Submitted 26 April, 2023; v1 submitted 3 April, 2023;
originally announced April 2023.
-
Machine learning for discovering laws of nature
Authors:
Lizhi Xin,
Kevin Xin,
Houwen Xin
Abstract:
Based on Darwin's natural selection, we developed "machine scientists" to discover the laws of nature by learning from raw data. "Machine scientists" construct physical theories by applying a logic tree (state Decision Tree) and a value tree (observation Function Tree); the logical tree determines the state of the entity, and the value tree determines the absolute value between the two observation…
▽ More
Based on Darwin's natural selection, we developed "machine scientists" to discover the laws of nature by learning from raw data. "Machine scientists" construct physical theories by applying a logic tree (state Decision Tree) and a value tree (observation Function Tree); the logical tree determines the state of the entity, and the value tree determines the absolute value between the two observations of the entity. A logic Tree and a value tree together can reconstruct an entity's trajectory and make predictions about its future outcomes. Our proposed algorithmic model has an emphasis on machine learning - where "machine scientists" builds up its experience by being rewarded or punished for each decision they make - eventually leading to rediscovering Newton's equation (classical physics) and the Born's rule (quantum mechanics).
△ Less
Submitted 8 July, 2023; v1 submitted 18 March, 2023;
originally announced March 2023.
-
MnEdgeNet -- Accurate Decomposition of Mixed Oxidation States for Mn XAS and EELS L2,3 Edges without Reference and Calibration
Authors:
Huolin L. Xin,
Mike Hu
Abstract:
Accurate decomposition of the mixed Mn oxidation states is highly important for characterizing the electronic structures, charge transfer, and redox centers for electronic, electrocatalytic, and energy storage materials that contain Mn. Electron energy loss spectroscopy (EELS) and soft X-ray absorption spectroscopy (XAS) measurements of the Mn L2,3 edges are widely used for this purpose. To date,…
▽ More
Accurate decomposition of the mixed Mn oxidation states is highly important for characterizing the electronic structures, charge transfer, and redox centers for electronic, electrocatalytic, and energy storage materials that contain Mn. Electron energy loss spectroscopy (EELS) and soft X-ray absorption spectroscopy (XAS) measurements of the Mn L2,3 edges are widely used for this purpose. To date, although the measurement of the Mn L2,3 edges is straightforward given the sample is prepared properly, an accurate decomposition of the mix valence states of Mn remains non-trivial. For both EELS and XAS, 2+, 3+, 4+ reference spectra need to be taken on the same instrument/beamline and preferably in the same experimental session because the instrumental resolution and the energy axis offset could vary from one session to another. To circumvent this hurdle, in this study, we adopted a deep learning approach and developed a calibration-free and reference-free method to decompose the oxidation state of Mn L2,3 edges for both EELS and XAS. To synthesize physics-informed and ground-truth labeled training datasets, we created a forward model that takes into account plural scattering, instrumentation broadening, noise, and energy axis offset. With that, we created a 1.2 million-spectrum database with a three-element oxidation state composition label. The library includes a sufficient variety of data including both EELS and XAS spectra. By training on this large database, our convolutional neural network achieves 85% accuracy on the validation dataset. We tested the model and found it is robust against noise (down to PSNR of 10) and plural scattering (up to t/λ = 1). We further validated the model against spectral data that were not used in training.
△ Less
Submitted 20 October, 2022;
originally announced October 2022.
-
Periodic Artifact Reduction in Fourier transforms of Full Field Atomic Resolution Images
Authors:
Robert Hovden,
Yi Jiang,
Huolin L. Xin,
Lena F. Kourkoutis
Abstract:
The discrete Fourier transform is among the most routine tools used in high-resolution scanning / transmission electron microscopy (S/TEM). However, when calculating a Fourier transform, periodic boundary conditions are imposed and sharp discontinuities between the edges of an image cause a cross patterned artifact along the reciprocal space axes. This artifact can interfere with the analysis of r…
▽ More
The discrete Fourier transform is among the most routine tools used in high-resolution scanning / transmission electron microscopy (S/TEM). However, when calculating a Fourier transform, periodic boundary conditions are imposed and sharp discontinuities between the edges of an image cause a cross patterned artifact along the reciprocal space axes. This artifact can interfere with the analysis of reciprocal lattice peaks of an atomic resolution image. Here we demonstrate that the recently developed Periodic Plus Smooth Decomposition technique provides a simple, efficient method for reliable removal of artifacts caused by edge discontinuities. In this method, edge artifacts are reduced by subtracting a smooth background that solves Poisson's equation with boundary conditions set by the image's edges. Unlike the traditional windowed Fourier transforms, Periodic Plus Smooth Decomposition maintains sharp reciprocal lattice peaks from the image's entire field of view.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Electron energy loss spectroscopy database synthesis and automation of core-loss edge recognition by deep-learning neural networks
Authors:
Lingli Kong,
Zhengran Ji,
Huolin L. Xin
Abstract:
The ionization edges encoded in the electron energy loss spectroscopy (EELS) spectra enable advanced material analysis including composition analyses and elemental quantifications. The development of the parallel EELS instrument and fast, sensitive detectors have greatly improved the acquisition speed of EELS spectra. However, the traditional way of core-loss edge recognition is experience based a…
▽ More
The ionization edges encoded in the electron energy loss spectroscopy (EELS) spectra enable advanced material analysis including composition analyses and elemental quantifications. The development of the parallel EELS instrument and fast, sensitive detectors have greatly improved the acquisition speed of EELS spectra. However, the traditional way of core-loss edge recognition is experience based and human labor dependent, which limits the processing speed. So far, the low signal-noise ratio and the low jump ratio of the core-loss edges on the raw EELS spectra have been challenging for the automation of edge recognition. In this work, a convolutional-bidirectional long short-term memory neural network (CNN-BiLSTM) is proposed to automate the detection and elemental identification of core-loss edges from raw spectra. An EELS spectral database is synthesized by using our forward model to assist in the training and validation of the neural network. To make the synthesized spectra resemble the real spectra, we collected a large library of experimentally acquired EELS core edges. In synthesize the training library, the edges are modeled by fitting the multi-gaussian model to the real edges from experiments, and the noise and instrumental imperfectness are simulated and added. The well-trained CNN-BiLSTM network is tested against both the simulated spectra and real spectra collected from experiments. The high accuracy of the network, 94.9 %, proves that, without complicated preprocessing of the raw spectra, the proposed CNN-BiLSTM network achieves the automation of core-loss edge recognition for EELS spectra with high accuracy.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
Intelligent Electric Vehicle Charging Recommendation Based on Multi-Agent Reinforcement Learning
Authors:
Weijia Zhang,
Hao Liu,
Fan Wang,
Tong Xu,
Haoran Xin,
Dejing Dou,
Hui Xiong
Abstract:
Electric Vehicle (EV) has become a preferable choice in the modern transportation system due to its environmental and energy sustainability. However, in many large cities, EV drivers often fail to find the proper spots for charging, because of the limited charging infrastructures and the spatiotemporally unbalanced charging demands. Indeed, the recent emergence of deep reinforcement learning provi…
▽ More
Electric Vehicle (EV) has become a preferable choice in the modern transportation system due to its environmental and energy sustainability. However, in many large cities, EV drivers often fail to find the proper spots for charging, because of the limited charging infrastructures and the spatiotemporally unbalanced charging demands. Indeed, the recent emergence of deep reinforcement learning provides great potential to improve the charging experience from various aspects over a long-term horizon. In this paper, we propose a framework, named Multi-Agent Spatio-Temporal Reinforcement Learning (Master), for intelligently recommending public accessible charging stations by jointly considering various long-term spatiotemporal factors. Specifically, by regarding each charging station as an individual agent, we formulate this problem as a multi-objective multi-agent reinforcement learning task. We first develop a multi-agent actor-critic framework with the centralized attentive critic to coordinate the recommendation between geo-distributed agents. Moreover, to quantify the influence of future potential charging competition, we introduce a delayed access strategy to exploit the knowledge of future charging competition during training. After that, to effectively optimize multiple learning objectives, we extend the centralized attentive critic to multi-critics and develop a dynamic gradient re-weighting strategy to adaptively guide the optimization direction. Finally, extensive experiments on two real-world datasets demonstrate that Master achieves the best comprehensive performance compared with nine baseline approaches.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
Out-of-Town Recommendation with Travel Intention Modeling
Authors:
Haoran Xin,
Xinjiang Lu,
Tong Xu,
Hao Liu,
Jingjing Gu,
Dejing Dou,
Hui Xiong
Abstract:
Out-of-town recommendation is designed for those users who leave their home-town areas and visit the areas they have never been to before. It is challenging to recommend Point-of-Interests (POIs) for out-of-town users since the out-of-town check-in behavior is determined by not only the user's home-town preference but also the user's travel intention. Besides, the user's travel intentions are comp…
▽ More
Out-of-town recommendation is designed for those users who leave their home-town areas and visit the areas they have never been to before. It is challenging to recommend Point-of-Interests (POIs) for out-of-town users since the out-of-town check-in behavior is determined by not only the user's home-town preference but also the user's travel intention. Besides, the user's travel intentions are complex and dynamic, which leads to big difficulties in understanding such intentions precisely. In this paper, we propose a TRAvel-INtention-aware Out-of-town Recommendation framework, named TRAINOR. The proposed TRAINOR framework distinguishes itself from existing out-of-town recommenders in three aspects. First, graph neural networks are explored to represent users' home-town check-in preference and geographical constraints in out-of-town check-in behaviors. Second, a user-specific travel intention is formulated as an aggregation combining home-town preference and generic travel intention together, where the generic travel intention is regarded as a mixture of inherent intentions that can be learned by Neural Topic Model (NTM). Third, a non-linear mapping function, as well as a matrix factorization method, are employed to transfer users' home-town preference and estimate out-of-town POI's representation, respectively. Extensive experiments on real-world data sets validate the effectiveness of the TRAINOR framework. Moreover, the learned travel intention can deliver meaningful explanations for understanding a user's travel purposes.
△ Less
Submitted 6 February, 2021; v1 submitted 29 January, 2021;
originally announced January 2021.
-
TEMImageNet Training Library and AtomSegNet Deep-Learning Models for High-Precision Atom Segmentation, Localization, Denoising, and Super-Resolution Processing of Atomic-Resolution Images
Authors:
Ruoqian Lin,
Rui Zhang,
Chunyang Wang,
Xiao-Qing Yang,
Huolin L. Xin
Abstract:
Atom segmentation and localization, noise reduction and deblurring of atomic-resolution scanning transmission electron microscopy (STEM) images with high precision and robustness is a challenging task. Although several conventional algorithms, such has thresholding, edge detection and clustering, can achieve reasonable performance in some predefined sceneries, they tend to fail when interferences…
▽ More
Atom segmentation and localization, noise reduction and deblurring of atomic-resolution scanning transmission electron microscopy (STEM) images with high precision and robustness is a challenging task. Although several conventional algorithms, such has thresholding, edge detection and clustering, can achieve reasonable performance in some predefined sceneries, they tend to fail when interferences from the background are strong and unpredictable. Particularly, for atomic-resolution STEM images, so far there is no well-established algorithm that is robust enough to segment or detect all atomic columns when there is large thickness variation in a recorded image. Herein, we report the development of a training library and a deep learning method that can perform robust and precise atom segmentation, localization, denoising, and super-resolution processing of experimental images. Despite using simulated images as training datasets, the deep-learning model can self-adapt to experimental STEM images and shows outstanding performance in atom detection and localization in challenging contrast conditions and the precision consistently outperforms the state-of-the-art two-dimensional Gaussian fit method. Taking a step further, we have deployed our deep-learning models to a desktop app with a graphical user interface and the app is free and open-source. We have also built a TEM ImageNet project website for easy browsing and downloading of the training data.
△ Less
Submitted 20 February, 2021; v1 submitted 16 December, 2020;
originally announced December 2020.
-
Attacking and Defending Machine Learning Applications of Public Cloud
Authors:
Dou Goodman,
Hao Xin
Abstract:
Adversarial attack breaks the boundaries of traditional security defense. For adversarial attack and the characteristics of cloud services, we propose Security Development Lifecycle for Machine Learning applications, e.g., SDL for ML. The SDL for ML helps developers build more secure software by reducing the number and severity of vulnerabilities in ML-as-a-service, while reducing development cost…
▽ More
Adversarial attack breaks the boundaries of traditional security defense. For adversarial attack and the characteristics of cloud services, we propose Security Development Lifecycle for Machine Learning applications, e.g., SDL for ML. The SDL for ML helps developers build more secure software by reducing the number and severity of vulnerabilities in ML-as-a-service, while reducing development cost.
△ Less
Submitted 27 July, 2020;
originally announced August 2020.
-
Automatic Cross-Domain Transfer Learning for Linear Regression
Authors:
Liu Xinshun,
He Xin,
Mao Hui,
Liu Jing,
Lai Weizhong,
Ye Qingwen
Abstract:
Transfer learning research attempts to make model induction transferable across different domains. This method assumes that specific information regarding to which domain each instance belongs is known. This paper helps to extend the capability of transfer learning for linear regression problems to situations where the domain information is uncertain or unknown; in fact, the framework can be exten…
▽ More
Transfer learning research attempts to make model induction transferable across different domains. This method assumes that specific information regarding to which domain each instance belongs is known. This paper helps to extend the capability of transfer learning for linear regression problems to situations where the domain information is uncertain or unknown; in fact, the framework can be extended to classification problems. For normal datasets, we assume that some latent domain information is available for transfer learning. The instances in each domain can be inferred by different parameters. We obtain this domain information from the distribution of the regression coefficients corresponding to the explanatory variable $x$ as well as the response variable $y$ based on a Dirichlet process, which is more reasonable. As a result, we transfer not only variable $x$ as usual but also variable $y$, which is challenging since the testing data have no response value. Previous work mainly overcomes the problem via pseudo-labelling based on transductive learning, which introduces serious bias. We provide a novel framework for analysing the problem and considering this general situation: the joint distribution of variable $x$ and variable $y$. Furthermore, our method controls the bias well compared with previous work. We perform linear regression on the new feature space that consists of different latent domains and the target domain, which is from the testing data. The experimental results show that the proposed model performs well on real datasets.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Advbox: a toolbox to generate adversarial examples that fool neural networks
Authors:
Dou Goodman,
Hao Xin,
Wang Yang,
Wu Yuesheng,
Xiong Junfeng,
Zhang Huan
Abstract:
In recent years, neural networks have been extensively deployed for computer vision tasks, particularly visual classification problems, where new algorithms reported to achieve or even surpass the human performance. Recent studies have shown that they are all vulnerable to the attack of adversarial examples. Small and often imperceptible perturbations to the input images are sufficient to fool the…
▽ More
In recent years, neural networks have been extensively deployed for computer vision tasks, particularly visual classification problems, where new algorithms reported to achieve or even surpass the human performance. Recent studies have shown that they are all vulnerable to the attack of adversarial examples. Small and often imperceptible perturbations to the input images are sufficient to fool the most powerful neural networks. \emph{Advbox} is a toolbox to generate adversarial examples that fool neural networks in PaddlePaddle, PyTorch, Caffe2, MxNet, Keras, TensorFlow and it can benchmark the robustness of machine learning models. Compared to previous work, our platform supports black box attacks on Machine-Learning-as-a-service, as well as more attack scenarios, such as Face Recognition Attack, Stealth T-shirt, and DeepFake Face Detect. The code is licensed under the Apache 2.0 and is openly available at https://github.com/advboxes/AdvBox. Advbox now supports Python 3.
△ Less
Submitted 26 August, 2020; v1 submitted 13 January, 2020;
originally announced January 2020.
-
Memristor Crossbars with 4.5 Terabits-per-Inch-Square Density and Two Nanometer Dimension
Authors:
Shuang Pi,
Can Li,
Hao Jiang,
Weiwei Xia,
Huolin Xin,
J. Joshua Yang,
Qiangfei Xia
Abstract:
Memristor is a promising building block for the next generation nonvolatile random access memory and bio-inspired computing systems. Organizing memristors into high density crossbar arrays, although challenging, is critical to meet the ever-growing high capacity and low energy demands of these applications especially in the big data era. Here, we construct memristor crossbars with a single-layer d…
▽ More
Memristor is a promising building block for the next generation nonvolatile random access memory and bio-inspired computing systems. Organizing memristors into high density crossbar arrays, although challenging, is critical to meet the ever-growing high capacity and low energy demands of these applications especially in the big data era. Here, we construct memristor crossbars with a single-layer density up to 4.5 terabits per inch square, an order of magnitude denser than the state- of-the-art 64-layer triple level cell NAND flash technology. The memristors in the crossbars are 2 $\times$ 2 nm$^2$ in size, capable of switching with tens of nano ampere electric current. The densely packed memristor crossbars of extremely small working devices provides a power-efficient solution for high density information storage and processing.
△ Less
Submitted 27 May, 2018; v1 submitted 25 April, 2018;
originally announced April 2018.
-
GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies
Authors:
Jeremie S. Kim,
Damla Senol Cali,
Hongyi Xin,
Donghyuk Lee,
Saugata Ghose,
Mohammed Alser,
Hasan Hassan,
Oguz Ergin,
Can Alkan,
Onur Mutlu
Abstract:
Motivation: Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mappin…
▽ More
Motivation: Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments.
Results: We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x--6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x--3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm.
Availability: The code is available online at: https://github.com/CMU-SAFARI/GRIM
△ Less
Submitted 2 November, 2017;
originally announced November 2017.
-
Subjective Knowledge Acquisition and Enrichment Powered By Crowdsourcing
Authors:
Rui Meng,
Hao Xin,
Lei Chen,
Yangqiu Song
Abstract:
Knowledge bases (KBs) have attracted increasing attention due to its great success in various areas, such as Web and mobile search.Existing KBs are restricted to objective factual knowledge, such as city population or fruit shape, whereas,subjective knowledge, such as big city, which is commonly mentioned in Web and mobile queries, has been neglected. Subjective knowledge differs from objective kn…
▽ More
Knowledge bases (KBs) have attracted increasing attention due to its great success in various areas, such as Web and mobile search.Existing KBs are restricted to objective factual knowledge, such as city population or fruit shape, whereas,subjective knowledge, such as big city, which is commonly mentioned in Web and mobile queries, has been neglected. Subjective knowledge differs from objective knowledge in that it has no documented or observed ground truth. Instead, the truth relies on people's dominant opinion. Thus, we can use the crowdsourcing technique to get opinion from the crowd. In our work, we propose a system, called crowdsourced subjective knowledge acquisition (CoSKA),for subjective knowledge acquisition powered by crowdsourcing and existing KBs. The acquired knowledge can be used to enrich existing KBs in the subjective dimension which bridges the gap between existing objective knowledge and subjective queries.The main challenge of CoSKA is the conflict between large scale knowledge facts and limited crowdsourcing resource. To address this challenge, in this work, we define knowledge inference rules and then select the seed knowledge judiciously for crowdsourcing to maximize the inference power under the resource constraint. Our experimental results on real knowledge base and crowdsourcing platform verify the effectiveness of CoSKA system.
△ Less
Submitted 16 May, 2017;
originally announced May 2017.
-
Capacity of the Gaussian Two-Pair Two-Way Relay Channel to Within 1/2 Bit
Authors:
Xiaojun Yuan,
Haiyang Xin,
Soung-Chang Liew,
Yong Li
Abstract:
This paper studies the transceiver design of the Gaussian two-pair two-way relay channel (TWRC), where two pairs of users exchange information through a common relay in a pairwise manner. Our main contribution is to show that the capacity of the Gaussian two-pair TWRC is achievable to within 1/2 bit for arbitrary channel conditions. In the proof, we develop a hybrid coding scheme involving Gaussia…
▽ More
This paper studies the transceiver design of the Gaussian two-pair two-way relay channel (TWRC), where two pairs of users exchange information through a common relay in a pairwise manner. Our main contribution is to show that the capacity of the Gaussian two-pair TWRC is achievable to within 1/2 bit for arbitrary channel conditions. In the proof, we develop a hybrid coding scheme involving Gaussian random coding, nested lattice coding, superposition coding, and network-coded decoding. Further, we present a message-reassembling strategy to decouple the coding design for the user-to-relay and relay-to-user links, so as to provide flexibility to fully exploit the channel randomness. Finally, judicious power allocation at the relay is necessary to approach the channel capacity under various channel conditions.
△ Less
Submitted 31 May, 2018; v1 submitted 15 April, 2017;
originally announced April 2017.
-
GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping
Authors:
Mohammed Alser,
Hasan Hassan,
Hongyi Xin,
Oğuz Ergin,
Onur Mutlu,
Can Alkan
Abstract:
Motivation: High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -- called short reads -- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and "candidate" locations in that reference genome. The similarity measurem…
▽ More
Motivation: High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -- called short reads -- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and "candidate" locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (1) it is implemented using quadratic-time dynamic programming algorithms, and (2) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment operations. Results: We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. GateKeeper can be integrated with any mapper that performs sequence alignment for verification. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing up to 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10. Availability: https://github.com/BilkentCompGen/GateKeeper
△ Less
Submitted 26 September, 2020; v1 submitted 6 April, 2016;
originally announced April 2016.
-
Optimal Seed Solver: Optimizing Seed Selection in Read Mapping
Authors:
Hongyi Xin,
Richard Zhu,
Sunny Nahar,
John Emmons,
Gennady Pekhimenko,
Carl Kingsford,
Can Alkan,
Onur Mutlu
Abstract:
Motivation: Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both o…
▽ More
Motivation: Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the potential of the mapper to select less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds.
Results: We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-bp read in $O(x \times L)$ operations on average and in $O(x \times L^{2})$ operations in the worst case. We compared OSS against four state-of-the-art seed selection schemes and observed that OSS provides a 3-fold reduction of average seed frequency over the best previous seed selection optimizations.
△ Less
Submitted 26 June, 2015;
originally announced June 2015.
-
Data Processing For Atomic Resolution EELS
Authors:
Paul Cueva,
Robert Hovden,
Julia A. Mundy,
Huolin L. Xin,
David A. Muller
Abstract:
The high beam current and sub-angstrom resolution of aberration-corrected scanning transmission electron microscopes has enabled electron energy loss spectroscopic (EELS) mapping with atomic resolution. These spectral maps are often dose-limited and spatially oversampled, leading to low counts/channel and are thus highly sensitive to errors in background estimation. However, by taking advantage of…
▽ More
The high beam current and sub-angstrom resolution of aberration-corrected scanning transmission electron microscopes has enabled electron energy loss spectroscopic (EELS) mapping with atomic resolution. These spectral maps are often dose-limited and spatially oversampled, leading to low counts/channel and are thus highly sensitive to errors in background estimation. However, by taking advantage of redundancy in the dataset map one can improve background estimation and increase chemical sensitivity. We consider two such approaches- linear combination of power laws and local background averaging-that reduce background error and improve signal extraction. Principal components analysis (PCA) can also be used to analyze spectrum images, but the poor peak-to-background ratio in EELS can lead to serious artifacts if raw EELS data is PCA filtered. We identify common artifacts and discuss alternative approaches. These algorithms are implemented within the Cornell Spectrum Imager, an open source software package for spectroscopic analysis.
△ Less
Submitted 13 December, 2011;
originally announced December 2011.