Search | arXiv e-print repository

Tell Me What's Next: Textual Foresight for Generic UI Representations

Authors: Andrea Burns, Kate Saenko, Bryan A. Plummer

Abstract: Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To co… ▽ More Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To combat this, we propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. Our approach requires joint reasoning over elements and entire screens, resulting in improved UI features: on generation tasks, UI agents trained with Textual Foresight outperform state-of-the-art by 2% with 28x fewer images. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning. OpenApp enables new baselines, and we find Textual Foresight improves average task performance over them by 5.7% while having access to 2x less data. △ Less

Submitted 7 August, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 Findings. Data and code to be released at https://github.com/aburns4/textualforesight

arXiv:2405.02793 [pdf, other]

ImageInWords: Unlocking Hyper-Detailed Image Descriptions

Authors: Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut

Abstract: Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replet… ▽ More Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: Webpage (https://google.github.io/imageinwords), GitHub (https://github.com/google/imageinwords), HuggingFace (https://huggingface.co/datasets/google/imageinwords)

arXiv:2403.05530 [pdf, other]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1110 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 8 August, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2312.00171 [pdf, ps, other]

doi 10.1007/s10703-023-00441-y

Extending Rely-Guarantee thinking to handle Real-Time Scheduling

Authors: Cliff B. Jones, Alan Burns

Abstract: The reference point for developing any artefact is its specification; to develop software formally, a formal specification is required. For sequential programs, pre and post conditions (together with abstract objects) suffice; rely and guarantee conditions extend the scope of formal development approaches to tackle concurrency. In addition, real-time systems need ways of both requiring progress an… ▽ More The reference point for developing any artefact is its specification; to develop software formally, a formal specification is required. For sequential programs, pre and post conditions (together with abstract objects) suffice; rely and guarantee conditions extend the scope of formal development approaches to tackle concurrency. In addition, real-time systems need ways of both requiring progress and relating that progress to some notion of time. This paper extends rely-guarantee ideas to cope with specifications of -- and assumptions about -- real-time schedulers. Furthermore it shows how the approach helps identify and specify fault-tolerance aspects of such schedulers by systematically challenging the assumptions. △ Less

Submitted 30 November, 2023; originally announced December 2023.

Comments: Published on-line (2023-11-30) in "Formal Methods in System Design"

ACM Class: D.2; F.3

arXiv:2305.05432 [pdf, other]

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

Abstract: Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first… ▽ More Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page. WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted at the WikiWorkshop 2023. Data is readily available at https://github.com/google-research-datasets/wit/blob/main/wikiweb2m.md. arXiv admin note: text overlap with arXiv:2305.03668

arXiv:2305.03668 [pdf, other]

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

Abstract: Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia… ▽ More Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Extensive experiments show that the new data in WikiWeb2M improves task performance compared to prior work. △ Less

Submitted 20 October, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

Comments: Accepted in EMNLP 2023, revision contains camera ready edits. Data can be downloaded at https://github.com/google-research-datasets/wit/blob/main/wikiweb2m.md

arXiv:2303.16342 [pdf, other]

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko

Abstract: We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to… ▽ More We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to annotations during training. To overcome this challenge, we adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities. During inference, our approach can separate sounds given text, video and audio input, or given text and audio input alone. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets, including MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly supervised approaches despite not using object detectors or text labels during training. △ Less

Submitted 23 September, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

Comments: Accepted at CVPR 2023

arXiv:2209.10430 [pdf, other]

Real-Time Guarantees in Routerless Networks-on-Chip

Authors: Leandro Soares Indrusiak, Alan Burns

Abstract: This paper considers the use of routerless networks-on-chip as an alternative on-chip interconnect for multiprocessor systems requiring hard real-time guarantees for inter-processor communication. It presents a novel analytical framework that can provide latency upper bounds to real-time packet flows sent over routerless networks-on-chip, and it uses that framework to evaluate the ability of such… ▽ More This paper considers the use of routerless networks-on-chip as an alternative on-chip interconnect for multiprocessor systems requiring hard real-time guarantees for inter-processor communication. It presents a novel analytical framework that can provide latency upper bounds to real-time packet flows sent over routerless networks-on-chip, and it uses that framework to evaluate the ability of such networks to provide real-time guarantees. Extensive comparative analysis is provided, considering different architectures for routerless networks and a state-of-the-art wormhole network based on priority-preemptive routers as a baseline. △ Less

Submitted 21 September, 2022; originally announced September 2022.

arXiv:2206.05994 [pdf, other]

doi 10.23919/ACC53348.2022.9867252

Discretization and Stabilization of Energy-Based Controller for Period Switching Control and Flexible Scheduling

Authors: Seyed Amir Tafrishi, Xiaotian Dai, Yasuhisa Hirata, Alan Burns

Abstract: Emerging advanced control applications, with increased complexity in software but limited computing resources, suggest that real-time controllers should have adaptable designs. These control strategies also should be designed with consideration of the run-time behavior of the system. One of such research attempts is to design the controller along with the task scheduler, known as control-schedulin… ▽ More Emerging advanced control applications, with increased complexity in software but limited computing resources, suggest that real-time controllers should have adaptable designs. These control strategies also should be designed with consideration of the run-time behavior of the system. One of such research attempts is to design the controller along with the task scheduler, known as control-scheduling co-design, for more predictable timing behavior as well as surviving system overloads. Unlike traditional controller designs, which have equal-distance sampling periods, the co-design approach increases the system flexibility and resilience by explicitly considering timing properties, for example using an event-based controller or with multiple sampling times (non-uniform sampling and control). Within this context, we introduce the first work on the discretization of an energy-based controller that can switch arbitrarily between multiple periods and adjust the control parameters accordingly without destabilizing the system. A digital controller design based on this paradigm for a DC motor with an elastic load as an example is introduced and the stability condition is given based on the proposed Lyapunov function. The method is evaluated with various computer-based simulations which demonstrate its effectiveness. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: Accepted to 2022 American Control Conference (ACC), 6 pages, 8 figures

arXiv:2202.02312 [pdf, other]

A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

Authors: Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer

Abstract: Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Itera… ▽ More Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app. Mobile apps provide a scalable domain to study real downstream uses of VLN methods. Moreover, mobile app commands provide instruction for interactive navigation, as they result in action sequences with state changes via clicking, typing, or swiping. MoTIF is the first to include feasibility annotations, containing both binary feasibility labels and fine-grained labels for why tasks are unsatisfiable. We further collect follow-up questions for ambiguous queries to enable research on task uncertainty resolution. Equipped with our dataset, we propose the new problem of feasibility prediction, in which a natural language instruction and multimodal app environment are used to predict command feasibility. MoTIF provides a more realistic app dataset as it contains many diverse environments, high-level goals, and longer action sequences than prior work. We evaluate interactive VLN methods using MoTIF, quantify the generalization ability of current approaches to new app environments, and measure the effect of task feasibility on navigation performance. △ Less

Submitted 14 August, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: Accepted at the European Conference on Computer Vision (ECCV) 2022. This is a new version of the paper with additional experimental results and a few prior implementation bugs fixed

arXiv:2108.13270 [pdf, ps, other]

Making the Invisible Visible: Risks and Benefits of Disclosing Metadata in Visualization

Authors: Alyxander Burns, Thai On, Christiana Lee, Rachel Shapiro, Cindy Xiong, Narges Mahyar

Abstract: Accompanying a data visualization with metadata may benefit readers by facilitating content understanding, strengthening trust, and providing accountability. However, providing this kind of information may also have negative, unintended consequences, such as biasing readers' interpretations, a loss of trust as a result of too much transparency, and the possibility of opening visualization creators… ▽ More Accompanying a data visualization with metadata may benefit readers by facilitating content understanding, strengthening trust, and providing accountability. However, providing this kind of information may also have negative, unintended consequences, such as biasing readers' interpretations, a loss of trust as a result of too much transparency, and the possibility of opening visualization creators with minoritized identities up to undeserved critique. To help future visualization researchers and practitioners decide what kinds of metadata to include, we discuss some of the potential benefits and risks of disclosing five kinds of metadata: metadata about the source of the underlying data; the cleaning and processing conducted; the marks, channels, and other design elements used; the people who directly created the visualization; and the people for whom the visualization was created. We conclude by proposing a few open research questions related to how to communicate metadata about visualizations. △ Less

Submitted 30 August, 2021; originally announced August 2021.

Comments: To appear in the Visualization for Social Good Workshop at VIS 2021

arXiv:2108.06613 [pdf, other]

Unsupervised Disentanglement without Autoencoding: Pitfalls and Future Directions

Authors: Andrea Burns, Aaron Sarna, Dilip Krishnan, Aaron Maschinot

Abstract: Disentangled visual representations have largely been studied with generative models such as Variational AutoEncoders (VAEs). While prior work has focused on generative methods for disentangled representation learning, these approaches do not scale to large datasets due to current limitations of generative models. Instead, we explore regularization methods with contrastive learning, which could re… ▽ More Disentangled visual representations have largely been studied with generative models such as Variational AutoEncoders (VAEs). While prior work has focused on generative methods for disentangled representation learning, these approaches do not scale to large datasets due to current limitations of generative models. Instead, we explore regularization methods with contrastive learning, which could result in disentangled representations that are powerful enough for large scale datasets and downstream applications. However, we find that unsupervised disentanglement is difficult to achieve due to optimization and initialization sensitivity, with trade-offs in task performance. We evaluate disentanglement with downstream tasks, analyze the benefits and disadvantages of each regularization used, and discuss future directions. △ Less

Submitted 14 August, 2021; originally announced August 2021.

Comments: Accepted at the ICML 2021 Self-Supervised Learning for Reasoning and Perception Workshop

arXiv:2104.08560 [pdf, other]

Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments

Authors: Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer

Abstract: In recent years, vision-language research has shifted to study tasks which require more complex reasoning, such as interactive question answering, visual common sense reasoning, and question-answer plausibility prediction. However, the datasets used for these problems fail to capture the complexity of real inputs and multimodal environments, such as ambiguous natural language requests and diverse… ▽ More In recent years, vision-language research has shifted to study tasks which require more complex reasoning, such as interactive question answering, visual common sense reasoning, and question-answer plausibility prediction. However, the datasets used for these problems fail to capture the complexity of real inputs and multimodal environments, such as ambiguous natural language requests and diverse digital domains. We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a dataset with natural language commands for the greatest number of interactive environments to date. MoTIF is the first to contain natural language requests for interactive environments that are not satisfiable, and we obtain follow-up questions on this subset to enable research on task uncertainty resolution. We perform initial feasibility classification experiments and only reach an F1 score of 37.3, verifying the need for richer vision-language representations and improved architectures to reason about task feasibility. △ Less

Submitted 17 April, 2021; originally announced April 2021.

Comments: Accepted at the workshop on Visually Grounded Interaction and Language (ViGIL) at NAACL 2021

arXiv:2103.06997 [pdf, other]

The Location of Optimal Object Colors with More Than Two Transitions (Preprint)

Authors: Scott A. Burns

Abstract: The chromaticity diagram associated with the CIE 1931 color matching functions is shown to be slightly non-convex. While having no impact on practical colorimetric computations, the non-convexity does have a significant impact on the shape of some optimal object color reflectance distributions associated with the outer surface of the object color solid. Instead of the usual two-transition Schrodin… ▽ More The chromaticity diagram associated with the CIE 1931 color matching functions is shown to be slightly non-convex. While having no impact on practical colorimetric computations, the non-convexity does have a significant impact on the shape of some optimal object color reflectance distributions associated with the outer surface of the object color solid. Instead of the usual two-transition Schrodinger form, many optimal colors exhibit higher transition counts. A linear programming formulation is developed and is used to locate where these higher-transition optimal object colors reside on the object color solid surface. The regions of higher transition count appear to have a point-symmetric complementary structure. The final peer-reviewed version (to appear) contains additional material concerning convexification of the color-matching functions and and additional analysis of modern "physiologically-relevant" CMFs transformed from cone fundamentals. △ Less

Submitted 14 May, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

Comments: 5/14/21 version adds notice of acceptance for publication and changes made in final version

arXiv:2012.01493 [pdf, ps, other]

A Rely-Guarantee Specification of Mixed-Criticality Scheduling

Authors: Cliff B Jones, Alan Burns

Abstract: The application considered is mixed-criticality scheduling. The core formal approaches used are Rely-Guarantee conditions and the Timeband framework; these are applied to give a layered description of job scheduling which includes resilience to jobs overrunning their expected execution time. A novel formal modelling idea is proposed to handle the relationship between actual time and its approximat… ▽ More The application considered is mixed-criticality scheduling. The core formal approaches used are Rely-Guarantee conditions and the Timeband framework; these are applied to give a layered description of job scheduling which includes resilience to jobs overrunning their expected execution time. A novel formal modelling idea is proposed to handle the relationship between actual time and its approximation in hardware clocks. △ Less

Submitted 21 August, 2021; v1 submitted 2 December, 2020; originally announced December 2020.

Comments: This paper will appear in a Festschrift - on publication we will insert a pointer to the book

Journal ref: Mathematical Foundations of Software Engineering, College Publication, 2022, Chap 6

arXiv:2009.01747 [pdf, other]

How to evaluate data visualizations across different levels of understanding

Authors: Alyxander Burns, Cindy Xiong, Steven Franconeri, Alberto Cairo, Narges Mahyar

Abstract: Understanding a visualization is a multi-level process. A reader must extract and extrapolate from numeric facts, understand how those facts apply to both the context of the data and other potential contexts, and draw or evaluate conclusions from the data. A well-designed visualization should support each of these levels of understanding. We diagnose levels of understanding of visualized data by a… ▽ More Understanding a visualization is a multi-level process. A reader must extract and extrapolate from numeric facts, understand how those facts apply to both the context of the data and other potential contexts, and draw or evaluate conclusions from the data. A well-designed visualization should support each of these levels of understanding. We diagnose levels of understanding of visualized data by adapting Bloom's taxonomy, a common framework from the education literature. We describe each level of the framework and provide examples for how it can be applied to evaluate the efficacy of data visualizations along six levels of knowledge acquisition - knowledge, comprehension, application, analysis, synthesis, and evaluation. We present three case studies showing that this framework expands on existing methods to comprehensively measure how a visualization design facilitates a viewer's understanding of visualizations. Although Bloom's original taxonomy suggests a strong hierarchical structure for some domains, we found few examples of dependent relationships between performance at different levels for our three case studies. If this level-independence holds across new tested visualizations, the taxonomy could serve to inspire more targeted evaluations of levels of understanding that are relevant to a communication goal. △ Less

Submitted 3 September, 2020; originally announced September 2020.

Comments: 8 pages, 3 figures, accepted for presentation at BELIV 2020

arXiv:2004.04312 [pdf, other]

Learning to Scale Multilingual Representations for Vision-Language Tasks

Authors: Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, Bryan A. Plummer

Abstract: Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed… ▽ More Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few. We use a masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods. △ Less

Submitted 27 August, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: ECCV 2020 accepted spotlight paper

arXiv:1908.06327 [pdf, other]

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Authors: Andrea Burns, Reuben Tan, Kate Saenko, Stan Sclaroff, Bryan A. Plummer

Abstract: Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We believe that language features deserve more attention, and conduct experiments which compare different word… ▽ More Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We believe that language features deserve more attention, and conduct experiments which compare different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms an LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we propose a set of best practices for incorporating the language component of VL tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding: http://ai.bu.edu/grovle. △ Less

Submitted 17 August, 2019; originally announced August 2019.

Comments: ICCV 2019 accepted paper

arXiv:1902.10160 [pdf]

doi 10.1002/col.22384

Chromatic Adaptation Transform by Spectral Reconstruction (Preprint)

Authors: Scott A Burns

Abstract: A color appearance model (CAM) is an advanced colorimetric tool used to predict color appearance under a wide variety of viewing conditions. A chromatic adaptation transform (CAT) is an integral part of a CAM. Its role is to predict "corresponding colors," that is, a pair of colors that have the same color appearance when viewed under different illuminants, after partial or full adaptation to each… ▽ More A color appearance model (CAM) is an advanced colorimetric tool used to predict color appearance under a wide variety of viewing conditions. A chromatic adaptation transform (CAT) is an integral part of a CAM. Its role is to predict "corresponding colors," that is, a pair of colors that have the same color appearance when viewed under different illuminants, after partial or full adaptation to each illuminant. Modern CATs perform well when applied to a limited range of illuminant pairs and a limited range of source (test) colors. However, they can fail if operated outside these ranges. For imaging applications, it is important to have a CAT that can operate on any real color and illuminant pair without failure. This paper proposes a new CAT that does not operate on the standard von Kries model of adaptation. Instead it relies on spectral reconstruction and how these reconstructions behave with respect to different illuminants. It is demonstrated that the proposed CAT is immune to some of the limitations of existing CATs (such as producing colors with negative tristimulus values). The proposed CAT does not use established empirical corresponding-color datasets to optimize performance, as most modern CATs do, yet it performs as well as or better than the most recent CATs when tested against the corresponding-color datasets. This increase in robustness comes at the expense of additional complexity and computational effort. If robustness is of prime importance, then the proposed method may be justifiable. △ Less

Submitted 28 September, 2019; v1 submitted 26 February, 2019; originally announced February 2019.

Comments: Ver 2 adds the abstract. Ver 3 gives attribution to Eq 1. Ver 4 adds publication notice. Ver 5 corrects Table 4. Ver 6 adds email address, date, and updates publication notice. Ver 7 adds link to full text of the final published version at Color Res Appl. Ver 8 adds citation of final publication

Journal ref: Color Res Appl. 2019;44(5):682-693

arXiv:1710.06364 [pdf]

Subtractive Color Mixture Computation

Authors: Scott Allen Burns

Abstract: Modeling subtractive color mixture (e.g., the way that paints mix) is difficult when working with colors described only by three-dimensional color space values, such as RGB. Although RGB values are sufficient to describe a specific color sensation, they do not contain enough information to predict the RGB color that would result from a subtractive mixture of two specified RGB colors. Methods do ex… ▽ More Modeling subtractive color mixture (e.g., the way that paints mix) is difficult when working with colors described only by three-dimensional color space values, such as RGB. Although RGB values are sufficient to describe a specific color sensation, they do not contain enough information to predict the RGB color that would result from a subtractive mixture of two specified RGB colors. Methods do exist for accurately modeling subtractive mixture, such as the Kubelka-Munk equations, but require extensive spectrophotometric measurements of the mixed components, making them unsuitable for many computer graphics applications. This paper presents a strategy for modeling subtractive color mixture given only the RGB information of the colors being mixed, written for a general audience. The RGB colors are first transformed to generic, representative spectral distributions, and then this spectral information is used to perform the subtractive mixture, using the weighted arithmetic-geometric mean. This strategy provides reasonable, representative subtractive mixture colors with only modest computational effort and no experimental measurements. As such, it provides a useful way to model subtractive color mixture in computer graphics applications. △ Less

Submitted 17 October, 2017; originally announced October 2017.

ACM Class: I.3.7

arXiv:1710.05732 [pdf]

Generating Reflectance Curves from sRGB Triplets

Authors: Scott Allen Burns

Abstract: The color sensation evoked by an object depends on both the spectral power distribution of the illumination and the reflectance properties of the object being illuminated. The color sensation can be characterized by three color-space values, such as XYZ, RGB, HSV, L*a*b*, etc. It is straightforward to compute the three values given the illuminant and reflectance curves. The converse process of com… ▽ More The color sensation evoked by an object depends on both the spectral power distribution of the illumination and the reflectance properties of the object being illuminated. The color sensation can be characterized by three color-space values, such as XYZ, RGB, HSV, L*a*b*, etc. It is straightforward to compute the three values given the illuminant and reflectance curves. The converse process of computing a reflectance curve given the color-space values and the illuminant is complicated by the fact that an infinite number of different reflectance curves can give rise to a single set of color-space values (metamerism). This paper presents five algorithms for generating a reflectance curve from a specified sRGB triplet, written for a general audience. The algorithms are designed to generate reflectance curves that are similar to those found with naturally occurring colored objects. The computed reflectance curves are compared to a database of thousands of reflectance curves measured from paints and pigments available both commercially and in nature, and the similarity is quantified. One particularly useful application of these algorithms is in the field of computer graphics, where modeling color transformations sometimes requires wavelength-specific information, such as when modeling subtractive color mixture. △ Less

Submitted 9 January, 2020; v1 submitted 11 October, 2017; originally announced October 2017.

Comments: v3 minor editing to clarify some points, and some webpage link updates, v4 adds the LHTSS method, v5 indicates LHTSS should be preferred to ILLSS generally

ACM Class: I.2.10; I.3.7; I.4.8

arXiv:1702.05398 [pdf, other]

Experiment Segmentation in Scientific Discourse as Clause-level Structured Prediction using Recurrent Neural Networks

Authors: Pradeep Dasigi, Gully A. P. C. Burns, Eduard Hovy, Anita de Waard

Abstract: We propose a deep learning model for identifying structure within experiment narratives in scientific literature. We take a sequence labeling approach to this problem, and label clauses within experiment narratives to identify the different parts of the experiment. Our dataset consists of paragraphs taken from open access PubMed papers labeled with rhetorical information as a result of our pilot a… ▽ More We propose a deep learning model for identifying structure within experiment narratives in scientific literature. We take a sequence labeling approach to this problem, and label clauses within experiment narratives to identify the different parts of the experiment. Our dataset consists of paragraphs taken from open access PubMed papers labeled with rhetorical information as a result of our pilot annotation. Our model is a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) cells that labels clauses. The clause representations are computed by combining word representations using a novel attention mechanism that involves a separate RNN. We compare this model against LSTMs where the input layer has simple or no attention and a feature rich CRF model. Furthermore, we describe how our work could be useful for information extraction from scientific literature. △ Less

Submitted 17 February, 2017; originally announced February 2017.

arXiv:1606.02942 [pdf, other]

Analysis of buffering effects on hard real-time priority-preemptive wormhole networks

Authors: Leandro Soares Indrusiak, Alan Burns, Borislav Nikolic

Abstract: There are several approaches to analyse the worst-case response times of sporadic packets transmitted over priority-preemptive wormhole networks. In this paper, we provide an overview of the different approaches, discuss their strengths and weaknesses, and propose an approach that captures all effects considered by previous approaches while providing tight yet safe upper bounds for packet response… ▽ More There are several approaches to analyse the worst-case response times of sporadic packets transmitted over priority-preemptive wormhole networks. In this paper, we provide an overview of the different approaches, discuss their strengths and weaknesses, and propose an approach that captures all effects considered by previous approaches while providing tight yet safe upper bounds for packet response times. We specifically address the problems created by buffering and backpressure in wormhole networks, which amplifies the problem of indirect interference in a way that has not been considered by the early analysis approaches. Didactic examples and large-scale experiments with synthetically generated packet flow sets provide evidence of the strength of the proposed approach. △ Less

Submitted 9 June, 2016; originally announced June 2016.

arXiv:1209.5922 [pdf]

Towards structured sharing of raw and derived neuroimaging data across existing resources

Authors: D. B. Keator, K. Helmer, J. Steffener, J. A. Turner, T. G. M. Van Erp, S. Gadde, N. Ashish, G. A. Burns, B. N. Nichols, S. S. Ghosh

Abstract: Data sharing efforts increasingly contribute to the acceleration of scientific discovery. Neuroimaging data is accumulating in distributed domain-specific databases and there is currently no integrated access mechanism nor an accepted format for the critically important meta-data that is necessary for making use of the combined, available neuroimaging data. In this manuscript, we present work from… ▽ More Data sharing efforts increasingly contribute to the acceleration of scientific discovery. Neuroimaging data is accumulating in distributed domain-specific databases and there is currently no integrated access mechanism nor an accepted format for the critically important meta-data that is necessary for making use of the combined, available neuroimaging data. In this manuscript, we present work from the Derived Data Working Group, an open-access group sponsored by the Biomedical Informatics Research Network (BIRN) and the International Neuroimaging Coordinating Facility (INCF) focused on practical tools for distributed access to neuroimaging data. The working group develops models and tools facilitating the structured interchange of neuroimaging meta-data and is making progress towards a unified set of tools for such data and meta-data exchange. We report on the key components required for integrated access to raw and derived neuroimaging data as well as associated meta-data and provenance across neuroimaging resources. The components include (1) a structured terminology that provides semantic context to data, (2) a formal data model for neuroimaging with robust tracking of data provenance, (3) a web service-based application programming interface (API) that provides a consistent mechanism to access and query the data model, and (4) a provenance library that can be used for the extraction of provenance data by image analysts and imaging software developers. We believe that the framework and set of tools outlined in this manuscript have great potential for solving many of the issues the neuroimaging community faces when sharing raw and derived neuroimaging data across the various existing database systems for the purpose of accelerating scientific discovery. △ Less

Submitted 6 March, 2013; v1 submitted 26 September, 2012; originally announced September 2012.

Showing 1–24 of 24 results for author: Burns, A