Search | arXiv e-print repository

Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

Authors: Sheng-Yao Kuan, Jen-Hao Cheng, Hsiang-Wei Huang, Wenhao Chai, Cheng-Yen Yang, Hugo Latapie, Gaowen Liu, Bing-Fei Wu, Jenq-Neng Hwang

Abstract: In the domain of autonomous driving, the integration of multi-modal perception techniques based on data from diverse sensors has demonstrated substantial progress. Effectively surpassing the capabilities of state-of-the-art single-modality detectors through sensor fusion remains an active challenge. This work leverages the respective advantages of cameras in perspective view and radars in Bird's E… ▽ More In the domain of autonomous driving, the integration of multi-modal perception techniques based on data from diverse sensors has demonstrated substantial progress. Effectively surpassing the capabilities of state-of-the-art single-modality detectors through sensor fusion remains an active challenge. This work leverages the respective advantages of cameras in perspective view and radars in Bird's Eye View (BEV) to greatly enhance overall detection and tracking performance. Our approach, Camera-Radar Associated Fusion Tracking Booster (CRAFTBooster), represents a pioneering effort to enhance radar-camera fusion in the tracking stage, contributing to improved 3D MOT accuracy. The superior experimental results on the K-Radaar dataset, which exhibit 5-6% on IDF1 tracking performance gain, validate the potential of effective sensor fusion in advancing autonomous driving. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 2024 IEEE Intelligent Vehicles Symposium (IV)

arXiv:2407.06167 [pdf, other]

DεpS: Delayed ε-Shrinking for Faster Once-For-All Training

Authors: Aditya Annavajjala, Alind Khare, Animesh Agrawal, Igor Fedorov, Hugo Latapie, Myungjin Lee, Alexey Tumanov

Abstract: CNNs are increasingly deployed across different hardware, dynamic environments, and low-power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all tr… ▽ More CNNs are increasingly deployed across different hardware, dynamic environments, and low-power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all training has emerged as a scalable approach that jointly co-trains many models (subnets) at once with a constant training cost and finds specialized CNNs later. The scalability is achieved by training the full model and simultaneously reducing it to smaller subnets that share model weights (weight-shared shrinking). However, existing once-for-all training approaches incur huge training costs reaching 1200 GPU hours. We argue this is because they either start the process of shrinking the full model too early or too late. Hence, we propose Delayed $ε$-Shrinking (D$ε$pS) that starts the process of shrinking the full model when it is partially trained (~50%) which leads to training cost improvement and better in-place knowledge distillation to smaller models. The proposed approach also consists of novel heuristics that dynamically adjust subnet learning rates incrementally (E), leading to improved weight-shared knowledge distillation from larger to smaller subnets as well. As a result, DEpS outperforms state-of-the-art once-for-all training techniques across different datasets including CIFAR10/100, ImageNet-100, and ImageNet-1k on accuracy and cost. It achieves 1.83% higher ImageNet-1k top1 accuracy or the same accuracy with 1.3x reduction in FLOPs and 2.5x drop in training cost (GPU*hrs) △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: Accepted to the 18th European Conference on Computer Vision (ECCV 2024)

arXiv:2403.08108 [pdf, other]

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Authors: Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Yezi Liu, Fei Wen, Alvaro Velasquez, Hugo Latapie, Mohsen Imani

Abstract: Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models u… ▽ More Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models undergo extensive learning on a highly imbalanced and scarce dataset, resulting in capped performance, laborious training, and poor generalizability. In contrast, we propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection. Particularly for the latter, we resort to the recently successful large Vision-Language Models (VLMs) as our backbone, which provides rich semantic knowledge and a uniform embedding space for images and texts. Nevertheless, the naive application of VLMs leads to sub-optimal quality, due to the misalignment between embeddings of object images and their visual attributes, which are mainly adjective phrases. To this end, we design a transformer-based aligner after the pre-trained VLMs to re-calibrate both embeddings. Finally, we employ a trainable score function to post-process the VLM matching results for object selection. Experimental results demonstrate that our TaskCLIP outperforms the state-of-the-art DETR-based model TOIST by 3.5% and only requires a single NVIDIA RTX 4090 for both training and inference. △ Less

Submitted 6 September, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.05763 [pdf, other]

HDReason: Algorithm-Hardware Codesign for Hyperdimensional Knowledge Graph Reasoning

Authors: Hanning Chen, Yang Ni, Ali Zakeri, Zhuowen Zou, Sanggeon Yun, Fei Wen, Behnam Khaleghi, Narayan Srinivasa, Hugo Latapie, Mohsen Imani

Abstract: In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural netwo… ▽ More In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural network (GCN) involve extensive vertex/relation embedding updates and complicated score functions, which are inherently cumbersome for acceleration. As a result, existing accelerator designs are no longer optimal, and a novel algorithm-hardware co-design for KG reasoning is needed. Recently, brain-inspired HyperDimensional Computing (HDC) has been introduced as a promising solution for lightweight machine learning, particularly for graph learning applications. In this paper, we leverage HDC for an intrinsically more efficient and acceleration-friendly KGC algorithm. We also co-design an acceleration framework named HDReason targeting FPGA platforms. On the algorithm level, HDReason achieves a balance between high reasoning accuracy, strong model interpretability, and less computation complexity. In terms of architecture, HDReason offers reconfigurability, high training throughput, and low energy consumption. When compared with NVIDIA RTX 4090 GPU, the proposed accelerator achieves an average 10.6x speedup and 65x energy efficiency improvement. When conducting cross-models and cross-platforms comparison, HDReason yields an average 4.2x higher performance and 3.4x better energy efficiency with similar accuracy versus the state-of-the-art FPGA-based GCN training platform. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.14672 [pdf, other]

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments

Authors: Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srinivasa, Hugo Latapie, Yu Su

Abstract: The applications of large language models (LLMs) have expanded well beyond the confines of text processing, signaling a new era where LLMs are envisioned as generalist language agents capable of operating within complex real-world environments. These environments are often highly expansive, making it impossible for the LLM to process them within its short-term memory. Motivated by recent research… ▽ More The applications of large language models (LLMs) have expanded well beyond the confines of text processing, signaling a new era where LLMs are envisioned as generalist language agents capable of operating within complex real-world environments. These environments are often highly expansive, making it impossible for the LLM to process them within its short-term memory. Motivated by recent research on extending the capabilities of LLMs with tools, this paper investigates the intriguing potential of tools to augment LLMs in handling such complexity. To this end, we design customized tools to aid in the proactive exploration within these massive environments. Such tools can serve as a middleware layer shielding the LLM from environmental complexity. In two representative complex environments -- knowledge bases (KBs) and databases -- we demonstrate the significant potential of augmenting language agents with tools in complex environments. Notably, equipped with these tools, GPT-4 achieves 2.8X the performance of the best baseline in tasks requiring access to database content and 2.2X in KB tasks. Our findings illuminate the path for advancing language agents in complex real-world applications. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 16 pages, 8 figures, 4 tables

ACM Class: I.2.7

arXiv:2311.01623 [pdf, other]

VQPy: An Object-Oriented Approach to Modern Video Analytics

Authors: Shan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, Harry Xu

Abstract: Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop a… ▽ More Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework. △ Less

Submitted 3 June, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: MLSys'24

arXiv:2311.01423 [pdf, other]

CenterRadarNet: Joint 3D Object Detection and Tracking Framework using 4D FMCW Radar

Authors: Jen-Hao Cheng, Sheng-Yao Kuan, Hugo Latapie, Gaowen Liu, Jenq-Neng Hwang

Abstract: Robust perception is a vital component for ensuring safe autonomous and assisted driving. Automotive radar (77 to 81 GHz), which offers weather-resilient sensing, provides a complementary capability to the vision- or LiDAR-based autonomous driving systems. Raw radio-frequency (RF) radar tensors contain rich spatiotemporal semantics besides 3D location information. The majority of previous methods… ▽ More Robust perception is a vital component for ensuring safe autonomous and assisted driving. Automotive radar (77 to 81 GHz), which offers weather-resilient sensing, provides a complementary capability to the vision- or LiDAR-based autonomous driving systems. Raw radio-frequency (RF) radar tensors contain rich spatiotemporal semantics besides 3D location information. The majority of previous methods take in 3D (Doppler-range-azimuth) RF radar tensors, allowing prediction of an object's location, heading angle, and size in bird's-eye-view (BEV). However, they lack the ability to at the same time infer objects' size, orientation, and identity in the 3D space. To overcome this limitation, we propose an efficient joint architecture called CenterRadarNet, designed to facilitate high-resolution representation learning from 4D (Doppler-range-azimuth-elevation) radar data for 3D object detection and re-identification (re-ID) tasks. As a single-stage 3D object detector, CenterRadarNet directly infers the BEV object distribution confidence maps, corresponding 3D bounding box attributes, and appearance embedding for each pixel. Moreover, we build an online tracker utilizing the learned appearance embedding for re-ID. CenterRadarNet achieves the state-of-the-art result on the K-Radar 3D object detection benchmark. In addition, we present the first 3D object-tracking result using radar on the K-Radar dataset V2. In diverse driving scenarios, CenterRadarNet shows consistent, robust performance, emphasizing its wide applicability. △ Less

Submitted 4 November, 2023; v1 submitted 2 November, 2023; originally announced November 2023.

arXiv:2307.10577 [pdf, other]

Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced Perception based on Joint-Embedding & Contextual Label Affinity

Authors: Hugo Latapie, Shan Yu, Patrick Hammer, Kristinn R. Thorisson, Vahagn Petrosyan, Brandon Kynoch, Alind Khare, Payman Behnam, Alexey Tumanov, Aksheit Saxena, Anish Aralikatti, Hanning Chen, Mohsen Imani, Mike Archbold, Tangrui Li, Pei Wang, Justin Hart

Abstract: Traditional computer vision models often necessitate extensive data acquisition, annotation, and validation. These models frequently struggle in real-world applications, resulting in high false positive and negative rates, and exhibit poor adaptability to new scenarios, often requiring costly retraining. To address these issues, we present Ethosight, a flexible and adaptable zero-shot video analyt… ▽ More Traditional computer vision models often necessitate extensive data acquisition, annotation, and validation. These models frequently struggle in real-world applications, resulting in high false positive and negative rates, and exhibit poor adaptability to new scenarios, often requiring costly retraining. To address these issues, we present Ethosight, a flexible and adaptable zero-shot video analytics system. Ethosight begins from a clean slate based on user-defined video analytics, specified through natural language or keywords, and leverages joint embedding models and reasoning mechanisms informed by ontologies such as WordNet and ConceptNet. Ethosight operates effectively on low-cost edge devices and supports enhanced runtime adaptation, thereby offering a new approach to continuous learning without catastrophic forgetting. We provide empirical validation of Ethosight's promising effectiveness across diverse and complex use cases, while highlighting areas for further improvement. A significant contribution of this work is the release of all source code and datasets to enable full reproducibility and to foster further innovation in both the research and commercial domains. △ Less

Submitted 20 August, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

arXiv:2307.02738 [pdf, other]

RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models

Authors: Brandon Kynoch, Hugo Latapie, Dwane van der Sluis

Abstract: Large Language Models (LLMs) have made extraordinary progress in the field of Artificial Intelligence and have demonstrated remarkable capabilities across a large variety of tasks and domains. However, as we venture closer to creating Artificial General Intelligence (AGI) systems, we recognize the need to supplement LLMs with long-term memory to overcome the context window limitation and more impo… ▽ More Large Language Models (LLMs) have made extraordinary progress in the field of Artificial Intelligence and have demonstrated remarkable capabilities across a large variety of tasks and domains. However, as we venture closer to creating Artificial General Intelligence (AGI) systems, we recognize the need to supplement LLMs with long-term memory to overcome the context window limitation and more importantly, to create a foundation for sustained reasoning, cumulative learning and long-term user interaction. In this paper we propose RecallM, a novel architecture for providing LLMs with an adaptable and updatable long-term memory mechanism. Unlike previous methods, the RecallM architecture is particularly effective at belief updating and maintaining a temporal understanding of the knowledge provided to it. We demonstrate through various experiments the effectiveness of this architecture. Furthermore, through our own temporal understanding and belief updating experiments, we show that RecallM is four times more effective than using a vector database for updating knowledge previously stored in long-term memory. We also demonstrate that RecallM shows competitive performance on general question-answering and in-context learning tasks. △ Less

Submitted 2 October, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: 8 pages, 7 figures, 1 table, Our code is publicly available online at: https://github.com/cisco-open/DeepVision/tree/main/recallm

arXiv:2301.10879 [pdf, other]

SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference

Authors: Alind Khare, Animesh Agrawal, Aditya Annavajjala, Payman Behnam, Myungjin Lee, Hugo Latapie, Alexey Tumanov

Abstract: Neural Architecture Search (NAS) for Federated Learning (FL) is an emerging field. It automates the design and training of Deep Neural Networks (DNNs) when data cannot be centralized due to privacy, communication costs, or regulatory restrictions. Recent federated NAS methods not only reduce manual effort but also help achieve higher accuracy than traditional FL methods like FedAvg. Despite the su… ▽ More Neural Architecture Search (NAS) for Federated Learning (FL) is an emerging field. It automates the design and training of Deep Neural Networks (DNNs) when data cannot be centralized due to privacy, communication costs, or regulatory restrictions. Recent federated NAS methods not only reduce manual effort but also help achieve higher accuracy than traditional FL methods like FedAvg. Despite the success, existing federated NAS methods still fall short in satisfying diverse deployment targets common in on-device inference like hardware, latency budgets, or variable battery levels. Most federated NAS methods search for only a limited range of neuro-architectural patterns, repeat them in a DNN, thereby restricting achievable performance. Moreover, these methods incur prohibitive training costs to satisfy deployment targets. They perform the training and search of DNN architectures repeatedly for each case. SuperFedNAS addresses these challenges by decoupling the training and search in federated NAS. SuperFedNAS co-trains a large number of diverse DNN architectures contained inside one supernet in the FL setting. Post-training, clients perform NAS locally to find specialized DNNs by extracting different parts of the trained supernet with no additional training. SuperFedNAS takes O(1) (instead of O(N)) cost to find specialized DNN architectures in FL for any N deployment targets. As part of SuperFedNAS, we introduce MaxNet - a novel FL training algorithm that performs multi-objective federated optimization of a large number of DNN architectures ($\approx 5*10^8$) under different client data distributions. Overall, SuperFedNAS achieves upto 37.7% higher accuracy for the same MACs or upto 8.13x reduction in MACs for the same accuracy than existing federated NAS methods. △ Less

Submitted 11 July, 2024; v1 submitted 25 January, 2023; originally announced January 2023.

Comments: Accepted at ECCV 2024

arXiv:2301.07099 [pdf, other]

Adaptive Deep Neural Network Inference Optimization with EENet

Authors: Fatih Ilhan, Ka-Ho Chow, Sihao Hu, Tiansheng Huang, Selim Tekin, Wenqi Wei, Yanzhao Wu, Myungjin Lee, Ramana Kompella, Hugo Latapie, Gaowen Liu, Ling Liu

Abstract: Well-trained deep neural networks (DNNs) treat all test samples equally during prediction. Adaptive DNN inference with early exiting leverages the observation that some test examples can be easier to predict than others. This paper presents EENet, a novel early-exiting scheduling framework for multi-exit DNN models. Instead of having every sample go through all DNN layers during prediction, EENet… ▽ More Well-trained deep neural networks (DNNs) treat all test samples equally during prediction. Adaptive DNN inference with early exiting leverages the observation that some test examples can be easier to predict than others. This paper presents EENet, a novel early-exiting scheduling framework for multi-exit DNN models. Instead of having every sample go through all DNN layers during prediction, EENet learns an early exit scheduler, which can intelligently terminate the inference earlier for certain predictions, which the model has high confidence of early exit. As opposed to previous early-exiting solutions with heuristics-based methods, our EENet framework optimizes an early-exiting policy to maximize model accuracy while satisfying the given per-sample average inference budget. Extensive experiments are conducted on four computer vision datasets (CIFAR-10, CIFAR-100, ImageNet, Cityscapes) and two NLP datasets (SST-2, AgNews). The results demonstrate that the adaptive inference by EENet can outperform the representative existing early exit techniques. We also perform a detailed visualization analysis of the comparison results to interpret the benefits of EENet. △ Less

Submitted 1 December, 2023; v1 submitted 14 January, 2023; originally announced January 2023.

arXiv:2212.09724 [pdf, other]

doi 10.1145/3583780.3614769

A Retrieve-and-Read Framework for Knowledge Graph Link Prediction

Authors: Vardaan Pahuja, Boshi Wang, Hugo Latapie, Jayanth Srinivasa, Yu Su

Abstract: Knowledge graph (KG) link prediction aims to infer new facts based on existing facts in the KG. Recent studies have shown that using the graph neighborhood of a node via graph neural networks (GNNs) provides more useful information compared to just using the query information. Conventional GNNs for KG link prediction follow the standard message-passing paradigm on the entire KG, which leads to sup… ▽ More Knowledge graph (KG) link prediction aims to infer new facts based on existing facts in the KG. Recent studies have shown that using the graph neighborhood of a node via graph neural networks (GNNs) provides more useful information compared to just using the query information. Conventional GNNs for KG link prediction follow the standard message-passing paradigm on the entire KG, which leads to superfluous computation, over-smoothing of node representations, and also limits their expressive power. On a large scale, it becomes computationally expensive to aggregate useful information from the entire KG for inference. To address the limitations of existing KG link prediction frameworks, we propose a novel retrieve-and-read framework, which first retrieves a relevant subgraph context for the query and then jointly reasons over the context and the query with a high-capacity reader. As part of our exemplar instantiation for the new framework, we propose a novel Transformer-based GNN as the reader, which incorporates graph-based attention structure and cross-attention between query and context for deep fusion. This simple yet effective design enables the model to focus on salient context information relevant to the query. Empirical results on two standard KG link prediction datasets demonstrate the competitive performance of the proposed method. Furthermore, our analysis yields valuable insights for designing improved retrievers within the framework. △ Less

Submitted 22 October, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: Accepted to CIKM'23; Published version DOI: https://doi.org/10.1145/3583780.3614769 ;12 pages, 4 figures

Journal ref: CIKM (2023) 1992-2002

arXiv:2208.03620 [pdf, other]

Learning Omnidirectional Flow in 360-degree Video via Siamese Representation

Authors: Keshav Bhandari, Bin Duan, Gaowen Liu, Hugo Latapie, Ziliang Zong, Yan Yan

Abstract: Optical flow estimation in omnidirectional videos faces two significant issues: the lack of benchmark datasets and the challenge of adapting perspective video-based methods to accommodate the omnidirectional nature. This paper proposes the first perceptually natural-synthetic omnidirectional benchmark dataset with a 360-degree field of view, FLOW360, with 40 different videos and 4,000 video frames… ▽ More Optical flow estimation in omnidirectional videos faces two significant issues: the lack of benchmark datasets and the challenge of adapting perspective video-based methods to accommodate the omnidirectional nature. This paper proposes the first perceptually natural-synthetic omnidirectional benchmark dataset with a 360-degree field of view, FLOW360, with 40 different videos and 4,000 video frames. We conduct comprehensive characteristic analysis and comparisons between our dataset and existing optical flow datasets, which manifest perceptual realism, uniqueness, and diversity. To accommodate the omnidirectional nature, we present a novel Siamese representation Learning framework for Omnidirectional Flow (SLOF). We train our network in a contrastive manner with a hybrid loss function that combines contrastive loss and optical flow loss. Extensive experiments verify the proposed framework's effectiveness and show up to 40% performance improvement over the state-of-the-art approaches. Our FLOW360 dataset and code are available at https://siamlof.github.io/. △ Less

Submitted 6 August, 2022; originally announced August 2022.

Comments: Accepted to ECCV22

arXiv:2112.01603 [pdf]

Neurosymbolic Systems of Perception & Cognition: The Role of Attention

Authors: Hugo Latapie, Ozkan Kilic, Kristinn R. Thorisson, Pei Wang, Patrick Hammer

Abstract: A cognitive architecture aimed at cumulative learning must provide the necessary information and control structures to allow agents to learn incrementally and autonomously from their experience. This involves managing an agent's goals as well as continuously relating sensory information to these in its perception-cognition information stack. The more varied the environment of a learning agent is,… ▽ More A cognitive architecture aimed at cumulative learning must provide the necessary information and control structures to allow agents to learn incrementally and autonomously from their experience. This involves managing an agent's goals as well as continuously relating sensory information to these in its perception-cognition information stack. The more varied the environment of a learning agent is, the more general and flexible must be these mechanisms to handle a wider variety of relevant patterns, tasks, and goal structures. While many researchers agree that information at different levels of abstraction likely differs in its makeup and structure and processing mechanisms, agreement on the particulars of such differences is not generally shared in the research community. A binary processing architecture (often referred to as System-1 and System-2) has been proposed as a model of cognitive processing for low- and high-level information, respectively. We posit that cognition is not binary in this way and that knowledge at any level of abstraction involves what we refer to as neurosymbolic information, meaning that data at both high and low levels must contain both symbolic and subsymbolic information. Further, we argue that the main differentiating factor between the processing of high and low levels of data abstraction can be largely attributed to the nature of the involved attention mechanisms. We describe the key arguments behind this view and review relevant evidence from the literature. △ Less

Submitted 2 December, 2021; originally announced December 2021.

arXiv:2107.03120 [pdf, other]

Cross-View Exocentric to Egocentric Video Synthesis

Authors: Gaowen Liu, Hao Tang, Hugo Latapie, Jason Corso, Yan Yan

Abstract: Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view. In this paper, we investigate the exocentric (third-person) view to egocentric (first-person) view video generation task. This is challenging because egocentric view sometimes is remarkably different from the exocentric view. Thus, transforming the appearances across the two diff… ▽ More Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view. In this paper, we investigate the exocentric (third-person) view to egocentric (first-person) view video generation task. This is challenging because egocentric view sometimes is remarkably different from the exocentric view. Thus, transforming the appearances across the two different views is a non-trivial task. Particularly, we propose a novel Bi-directional Spatial Temporal Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and temporal information to generate egocentric video sequences from the exocentric view. The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and attention fusion. First, the temporal and spatial branches generate a sequence of fake frames and their corresponding features. The fake frames are generated in both downstream and upstream directions for both temporal and spatial branches. Next, the generated four different fake frames and their corresponding features (spatial and temporal branches in two directions) are fed into a novel multi-generation attention fusion module to produce the final video sequence. Meanwhile, we also propose a novel temporal and spatial dual-discriminator for more robust network optimization. Extensive experiments on the Side2Ego and Top2Ego datasets show that the proposed STA-GAN significantly outperforms the existing methods. △ Less

Submitted 7 July, 2021; originally announced July 2021.

Comments: ACM MM 2021

arXiv:2102.06112 [pdf, other]

A Metamodel and Framework for Artificial General Intelligence From Theory to Practice

Authors: Hugo Latapie, Ozkan Kilic, Gaowen Liu, Yan Yan, Ramana Kompella, Pei Wang, Kristinn R. Thorisson, Adam Lawrence, Yuhong Sun, Jayanth Srinivasa

Abstract: This paper introduces a new metamodel-based knowledge representation that significantly improves autonomous learning and adaptation. While interest in hybrid machine learning / symbolic AI systems leveraging, for example, reasoning and knowledge graphs, is gaining popularity, we find there remains a need for both a clear definition of knowledge and a metamodel to guide the creation and manipulatio… ▽ More This paper introduces a new metamodel-based knowledge representation that significantly improves autonomous learning and adaptation. While interest in hybrid machine learning / symbolic AI systems leveraging, for example, reasoning and knowledge graphs, is gaining popularity, we find there remains a need for both a clear definition of knowledge and a metamodel to guide the creation and manipulation of knowledge. Some of the benefits of the metamodel we introduce in this paper include a solution to the symbol grounding problem, cumulative learning, and federated learning. We have applied the metamodel to problems ranging from time series analysis, computer vision, and natural language understanding and have found that the metamodel enables a wide variety of learning mechanisms ranging from machine learning, to graph network analysis and learning by reasoning engines to interoperate in a highly synergistic way. Our metamodel-based projects have consistently exhibited unprecedented accuracy, performance, and ability to generalize. This paper is inspired by the state-of-the-art approaches to AGI, recent AGI-aspiring work, the granular computing community, as well as Alfred Korzybski's general semantics. One surprising consequence of the metamodel is that it not only enables a new level of autonomous learning and optimal functioning for machine intelligences, but may also shed light on a path to better understanding how to improve human cognition. △ Less

Submitted 11 February, 2021; originally announced February 2021.

Comments: arXiv admin note: text overlap with arXiv:2008.12879

arXiv:2102.03424 [pdf, other]

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

Authors: Ye Zhu, Yu Wu, Hugo Latapie, Yi Yang, Yan Yan

Abstract: People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as th… ▽ More People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-VAE can effectively learn the audio-visual correlations and can be readily applied in multiple audio-visual downstream tasks to achieve competitive performance even without any given label information during training. △ Less

Submitted 14 February, 2021; v1 submitted 5 February, 2021; originally announced February 2021.

Comments: Accepted to ICASSP 2021

arXiv:2010.08055 [pdf, other]

Egok360: A 360 Egocentric Kinetic Human Activity Video Dataset

Authors: Keshav Bhandari, Mario A. DeLaGarza, Ziliang Zong, Hugo Latapie, Yan Yan

Abstract: Recently, there has been a growing interest in wearable sensors which provides new research perspectives for 360 ° video analysis. However, the lack of 360 ° datasets in literature hinders the research in this field. To bridge this gap, in this paper we propose a novel Egocentric (first-person) 360° Kinetic human activity video dataset (EgoK360). The EgoK360 dataset contains annotations of human a… ▽ More Recently, there has been a growing interest in wearable sensors which provides new research perspectives for 360 ° video analysis. However, the lack of 360 ° datasets in literature hinders the research in this field. To bridge this gap, in this paper we propose a novel Egocentric (first-person) 360° Kinetic human activity video dataset (EgoK360). The EgoK360 dataset contains annotations of human activity with different sub-actions, e.g., activity Ping-Pong with four sub-actions which are pickup-ball, hit, bounce-ball and serve. To the best of our knowledge, EgoK360 is the first dataset in the domain of first-person activity recognition with a 360° environmental setup, which will facilitate the egocentric 360 ° video understanding. We provide experimental results and comprehensive analysis of variants of the two-stream network for 360 egocentric activity recognition. The EgoK360 dataset can be downloaded from https://egok360.github.io/. △ Less

Submitted 15 October, 2020; originally announced October 2020.

Comments: 5 pages, 5 figures, 1 table, 2020 IEEE International Conference on Image Processing (ICIP)

arXiv:2008.12879 [pdf, ps, other]

A Metamodel and Framework for AGI

Authors: Hugo Latapie, Ozkan Kilic

Abstract: Can artificial intelligence systems exhibit superhuman performance, but in critical ways, lack the intelligence of even a single-celled organism? The answer is clearly 'yes' for narrow AI systems. Animals, plants, and even single-celled organisms learn to reliably avoid danger and move towards food. This is accomplished via a physical knowledge preserving metamodel that autonomously generates usef… ▽ More Can artificial intelligence systems exhibit superhuman performance, but in critical ways, lack the intelligence of even a single-celled organism? The answer is clearly 'yes' for narrow AI systems. Animals, plants, and even single-celled organisms learn to reliably avoid danger and move towards food. This is accomplished via a physical knowledge preserving metamodel that autonomously generates useful models of the world. We posit that preserving the structure of knowledge is critical for higher intelligences that manage increasingly higher levels of abstraction, be they human or artificial. This is the key lesson learned from applying AGI subsystems to complex real-world problems that require continuous learning and adaptation. In this paper, we introduce the Deep Fusion Reasoning Engine (DFRE), which implements a knowledge-preserving metamodel and framework for constructing applied AGI systems. The DFRE metamodel exhibits some important fundamental knowledge preserving properties such as clear distinctions between symmetric and antisymmetric relations, and the ability to create a hierarchical knowledge representation that clearly delineates between levels of abstraction. The DFRE metamodel, which incorporates these capabilities, demonstrates how this approach benefits AGI in specific ways such as managing combinatorial explosion and enabling cumulative, distributed and federated learning. Our experiments show that the proposed framework achieves 94% accuracy on average on unsupervised object detection and recognition. This work is inspired by the state-of-the-art approaches to AGI, recent AGI-aspiring work, the granular computing community, as well as Alfred Korzybski's general semantics. △ Less

Submitted 6 September, 2020; v1 submitted 28 August, 2020; originally announced August 2020.

arXiv:2002.03219 [pdf, other]

Exocentric to Egocentric Image Generation via Parallel Generative Adversarial Network

Authors: Gaowen Liu, Hao Tang, Hugo Latapie, Yan Yan

Abstract: Cross-view image generation has been recently proposed to generate images of one view from another dramatically different view. In this paper, we investigate exocentric (third-person) view to egocentric (first-person) view image generation. This is a challenging task since egocentric view sometimes is remarkably different from exocentric view. Thus, transforming the appearances across the two view… ▽ More Cross-view image generation has been recently proposed to generate images of one view from another dramatically different view. In this paper, we investigate exocentric (third-person) view to egocentric (first-person) view image generation. This is a challenging task since egocentric view sometimes is remarkably different from exocentric view. Thus, transforming the appearances across the two views is a non-trivial task. To this end, we propose a novel Parallel Generative Adversarial Network (P-GAN) with a novel cross-cycle loss to learn the shared information for generating egocentric images from exocentric view. We also incorporate a novel contextual feature loss in the learning procedure to capture the contextual information in images. Extensive experiments on the Exo-Ego datasets show that our model outperforms the state-of-the-art approaches. △ Less

Submitted 8 February, 2020; originally announced February 2020.

Comments: It has been accepted by ICASSP 2020

arXiv:1907.01826 [pdf, other]

Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation

Authors: Bin Duan, Wei Wang, Hao Tang, Hugo Latapie, Yan Yan

Abstract: Since we were babies, we intuitively develop the ability to correlate the input from different cognitive sensors such as vision, audio, and text. However, in machine learning, this cross-modal learning is a nontrivial task because different modalities have no homogeneous properties. Previous works discover that there should be bridges among different modalities. From neurology and psychology persp… ▽ More Since we were babies, we intuitively develop the ability to correlate the input from different cognitive sensors such as vision, audio, and text. However, in machine learning, this cross-modal learning is a nontrivial task because different modalities have no homogeneous properties. Previous works discover that there should be bridges among different modalities. From neurology and psychology perspective, humans have the capacity to link one modality with another one, e.g., associating a picture of a bird with the only hearing of its singing and vice versa. Is it possible for machine learning algorithms to recover the scene given the audio signal? In this paper, we propose a novel Cascade Attention-Guided Residue GAN (CAR-GAN), aiming at reconstructing the scenes given the corresponding audio signals. Particularly, we present a residue module to mitigate the gap between different modalities progressively. Moreover, a cascade attention guided network with a novel classification loss function is designed to tackle the cross-modal learning task. Our model keeps the consistency in high-level semantic label domain and is able to balance two different modalities. The experimental results demonstrate that our model achieves the state-of-the-art cross-modal audio-visual generation on the challenging Sub-URMP dataset. Code will be available at https://github.com/tuffr5/CAR-GAN. △ Less

Submitted 10 December, 2021; v1 submitted 3 July, 2019; originally announced July 2019.

Comments: 9 pages, 6 figures, update template

arXiv:1807.10591 [pdf, other]

Metric Embedding Autoencoders for Unsupervised Cross-Dataset Transfer Learning

Authors: Alexey Potapov, Sergey Rodionov, Hugo Latapie, Enzo Fenoglio

Abstract: Cross-dataset transfer learning is an important problem in person re-identification (Re-ID). Unfortunately, not too many deep transfer Re-ID models exist for realistic settings of practical Re-ID systems. We propose a purely deep transfer Re-ID model consisting of a deep convolutional neural network and an autoencoder. The latent code is divided into metric embedding and nuisance variables. We the… ▽ More Cross-dataset transfer learning is an important problem in person re-identification (Re-ID). Unfortunately, not too many deep transfer Re-ID models exist for realistic settings of practical Re-ID systems. We propose a purely deep transfer Re-ID model consisting of a deep convolutional neural network and an autoencoder. The latent code is divided into metric embedding and nuisance variables. We then utilize an unsupervised training method that does not rely on co-training with non-deep models. Our experiments show improvements over both the baseline and competitors' transfer learning models. △ Less

Submitted 18 July, 2018; originally announced July 2018.

Comments: ICANN 2018 (The 27th International Conference on Artificial Neural Networks) proceeding

arXiv:1807.08526 [pdf]

doi 10.1007/978-3-319-92007-8_7

Improving Deep Models of Person Re-identification for Cross-Dataset Usage

Authors: Sergey Rodionov, Alexey Potapov, Hugo Latapie, Enzo Fenoglio, Maxim Peterson

Abstract: Person re-identification (Re-ID) is the task of matching humans across cameras with non-overlapping views that has important applications in visual surveillance. Like other computer vision tasks, this task has gained much with the utilization of deep learning methods. However, existing solutions based on deep learning are usually trained and tested on samples taken from same datasets, while in pra… ▽ More Person re-identification (Re-ID) is the task of matching humans across cameras with non-overlapping views that has important applications in visual surveillance. Like other computer vision tasks, this task has gained much with the utilization of deep learning methods. However, existing solutions based on deep learning are usually trained and tested on samples taken from same datasets, while in practice one need to deploy Re-ID systems for new sets of cameras for which labeled data is unavailable. Here, we mitigate this problem for one state-of-the-art model, namely, metric embedding trained with the use of the triplet loss function, although our results can be extended to other models. The contribution of our work consists in developing a method of training the model on multiple datasets, and a method for its online practically unsupervised fine-tuning. These methods yield up to 19.1% improvement in Rank-1 score in the cross-dataset evaluation. △ Less

Submitted 23 July, 2018; originally announced July 2018.

Comments: AIAI 2018 (14th International Conference on Artificial Intelligence Applications and Innovations) proceeding. The final publication is available at link.springer.com

arXiv:1806.06946 [pdf]

Semantic Image Retrieval by Uniting Deep Neural Networks and Cognitive Architectures

Authors: Alexey Potapov, Innokentii Zhdanov, Oleg Scherbakov, Nikolai Skorobogatko, Hugo Latapie, Enzo Fenoglio

Abstract: Image and video retrieval by their semantic content has been an important and challenging task for years, because it ultimately requires bridging the symbolic/subsymbolic gap. Recent successes in deep learning enabled detection of objects belonging to many classes greatly outperforming traditional computer vision techniques. However, deep learning solutions capable of executing retrieval queries a… ▽ More Image and video retrieval by their semantic content has been an important and challenging task for years, because it ultimately requires bridging the symbolic/subsymbolic gap. Recent successes in deep learning enabled detection of objects belonging to many classes greatly outperforming traditional computer vision techniques. However, deep learning solutions capable of executing retrieval queries are still not available. We propose a hybrid solution consisting of a deep neural network for object detection and a cognitive architecture for query execution. Specifically, we use YOLOv2 and OpenCog. Queries allowing the retrieval of video frames containing objects of specified classes and specified spatial arrangement are implemented. △ Less

Submitted 14 June, 2018; originally announced June 2018.

arXiv:1606.06222 [pdf, other]

doi 10.1145/3138808.3138810

Knowledge-Defined Networking

Authors: Albert Mestres, Alberto Rodriguez-Natal, Josep Carner, Pere Barlet-Ros, Eduard Alarcón, Marc Solé, Victor Muntés, David Meyer, Sharon Barkai, Mike J Hibbett, Giovani Estrada, Khaldun Ma`ruf, Florin Coras, Vina Ermagan, Hugo Latapie, Chris Cassar, John Evans, Fabio Maino, Jean Walrand, Albert Cabellos

Abstract: The research community has considered in the past the application of Artificial Intelligence (AI) techniques to control and operate networks. A notable example is the Knowledge Plane proposed by D.Clark et al. However, such techniques have not been extensively prototyped or deployed in the field yet. In this paper, we explore the reasons for the lack of adoption and posit that the rise of two rece… ▽ More The research community has considered in the past the application of Artificial Intelligence (AI) techniques to control and operate networks. A notable example is the Knowledge Plane proposed by D.Clark et al. However, such techniques have not been extensively prototyped or deployed in the field yet. In this paper, we explore the reasons for the lack of adoption and posit that the rise of two recent paradigms: Software-Defined Networking (SDN) and Network Analytics (NA), will facilitate the adoption of AI techniques in the context of network operation and control. We describe a new paradigm that accommodates and exploits SDN, NA and AI, and provide use cases that illustrate its applicability and benefits. We also present simple experimental results that support its feasibility. We refer to this new paradigm as Knowledge-Defined Networking (KDN). △ Less

Submitted 23 June, 2016; v1 submitted 20 June, 2016; originally announced June 2016.

Comments: 8 pages, 22 references, 6 figures and 1 table

Journal ref: ACM SIGCOMM Computer Communication Review, Volume 47, Issue 3, July 2017

Showing 1–25 of 25 results for author: Latapie, H