Search | arXiv e-print repository

doi 10.1142/S0219025724400022

Continuous Krishna-Parthasarathy Entropic Uncertainty Principle

Abstract: In 2002, Krishna and Parthasarathy [\textit{Sankhyā Ser. A}] derived discrete quantum version of Maassen-Uffink [\textit{Phys. Rev. Lett., 1988}] entropic uncertainty principle. In this paper, using the notion of continuous operator-valued frames, we derive an entropic uncertainty principle for arbitrary family of operators indexed by measure spaces having finite measure. We give an application to… ▽ More In 2002, Krishna and Parthasarathy [\textit{Sankhyā Ser. A}] derived discrete quantum version of Maassen-Uffink [\textit{Phys. Rev. Lett., 1988}] entropic uncertainty principle. In this paper, using the notion of continuous operator-valued frames, we derive an entropic uncertainty principle for arbitrary family of operators indexed by measure spaces having finite measure. We give an application to the special case of compact groups. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 7 pages, 0 Figures

MSC Class: 81P15; 94A17; 42C15

Journal ref: Special issue of Infinite Dimensional Analysis, Quantum Probability and Related Topics in honour of Prof. K. R. Parthasarathy, 18 March 2024

arXiv:2404.17922 [pdf, other]

Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

Authors: Laksh Nanwani, Kumaraditya Gupta, Aditya Mathur, Swayam Agrawal, A. H. Abdul Hafez, K. Madhava Krishna

Abstract: Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasi… ▽ More Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.17842 [pdf, other]

Using LLMs in Software Requirements Specifications: An Empirical Evaluation

Authors: Madhava Krishna, Bhagesh Gaur, Arsh Verma, Pankaj Jalote

Abstract: The creation of a Software Requirements Specification (SRS) document is important for any software development project. Given the recent prowess of Large Language Models (LLMs) in answering natural language queries and generating sophisticated textual outputs, our study explores their capability to produce accurate, coherent, and structured drafts of these documents to accelerate the software deve… ▽ More The creation of a Software Requirements Specification (SRS) document is important for any software development project. Given the recent prowess of Large Language Models (LLMs) in answering natural language queries and generating sophisticated textual outputs, our study explores their capability to produce accurate, coherent, and structured drafts of these documents to accelerate the software development lifecycle. We assess the performance of GPT-4 and CodeLlama in drafting an SRS for a university club management system and compare it against human benchmarks using eight distinct criteria. Our results suggest that LLMs can match the output quality of an entry-level software engineer to generate an SRS, delivering complete and consistent drafts. We also evaluate the capabilities of LLMs to identify and rectify problems in a given requirements document. Our experiments indicate that GPT-4 is capable of identifying issues and giving constructive feedback for rectifying them, while CodeLlama's results for validation were not as encouraging. We repeated the generation exercise for four distinct use cases to study the time saved by employing LLMs for SRS generation. The experiment demonstrates that LLMs may facilitate a significant reduction in development time for entry-level software engineers. Hence, we conclude that the LLMs can be gainfully used by software engineers to increase productivity by saving time and effort in generating, validating and rectifying software requirements. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: Accepted to RE@Next! at the IEEE International Requirements Engineering Conference 2024 at Reykjavik, Iceland

arXiv:2404.06442 [pdf, other]

QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding

Authors: Yash Mehan, Kumaraditya Gupta, Rohit Jayanti, Anirudh Govil, Sourav Garg, Madhava Krishna

Abstract: Understanding the structural organisation of 3D indoor scenes in terms of rooms is often accomplished via floorplan extraction. Robotic tasks such as planning and navigation require a semantic understanding of the scene as well. This is typically achieved via object-level semantic segmentation. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this w… ▽ More Understanding the structural organisation of 3D indoor scenes in terms of rooms is often accomplished via floorplan extraction. Robotic tasks such as planning and navigation require a semantic understanding of the scene as well. This is typically achieved via object-level semantic segmentation. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this work, we introduce a two-step pipeline. First, we extract a topological map, i.e., floorplan of the indoor scene using a novel multi-channel occupancy representation. Then, we generate CLIP-aligned features and semantic labels for every room instance based on the objects it contains using a self-attention transformer. Our language-topology alignment supports natural language querying, e.g., a "place to cook" locates the "kitchen". We outperform the current state-of-the-art on room segmentation by ~20% and room classification by ~12%. Our detailed qualitative analysis and ablation studies provide insights into the problem of joint structural and semantic 3D scene understanding. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.04643 [pdf, other]

Constrained 6-DoF Grasp Generation on Complex Shapes for Improved Dual-Arm Manipulation

Authors: Gaurav Singh, Sanket Kalwar, Md Faizal Karim, Bipasha Sen, Nagamanikandan Govindan, Srinath Sridhar, K Madhava Krishna

Abstract: Efficiently generating grasp poses tailored to specific regions of an object is vital for various robotic manipulation tasks, especially in a dual-arm setup. This scenario presents a significant challenge due to the complex geometries involved, requiring a deep understanding of the local geometry to generate grasps efficiently on the specified constrained regions. Existing methods only explore set… ▽ More Efficiently generating grasp poses tailored to specific regions of an object is vital for various robotic manipulation tasks, especially in a dual-arm setup. This scenario presents a significant challenge due to the complex geometries involved, requiring a deep understanding of the local geometry to generate grasps efficiently on the specified constrained regions. Existing methods only explore settings involving table-top/small objects and require augmented datasets to train, limiting their performance on complex objects. We propose CGDF: Constrained Grasp Diffusion Fields, a diffusion-based grasp generative model that generalizes to objects with arbitrary geometries, as well as generates dense grasps on the target regions. CGDF uses a part-guided diffusion approach that enables it to get high sample efficiency in constrained grasping without explicitly training on massive constraint-augmented datasets. We provide qualitative and quantitative comparisons using analytical metrics and in simulation, in both unconstrained and constrained settings to show that our method can generalize to generate stable grasps on complex objects, especially useful for dual-arm manipulation settings, while existing methods struggle to do so. △ Less

Submitted 15 July, 2024; v1 submitted 6 April, 2024; originally announced April 2024.

Comments: Project Page: https://constrained-grasp-diffusion.github.io/

arXiv:2404.03587 [pdf, other]

Anticipate & Collab: Data-driven Task Anticipation and Knowledge-driven Planning for Human-robot Collaboration

Authors: Shivam Singh, Karthik Swaminathan, Raghav Arora, Ramandeep Singh, Ahana Datta, Dipanjan Das, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna

Abstract: An agent assisting humans in daily living activities can collaborate more effectively by anticipating upcoming tasks. Data-driven methods represent the state of the art in task anticipation, planning, and related problems, but these methods are resource-hungry and opaque. Our prior work introduced a proof of concept framework that used an LLM to anticipate 3 high-level tasks that served as goals f… ▽ More An agent assisting humans in daily living activities can collaborate more effectively by anticipating upcoming tasks. Data-driven methods represent the state of the art in task anticipation, planning, and related problems, but these methods are resource-hungry and opaque. Our prior work introduced a proof of concept framework that used an LLM to anticipate 3 high-level tasks that served as goals for a classical planning system that computed a sequence of low-level actions for the agent to achieve these goals. This paper describes DaTAPlan, our framework that significantly extends our prior work toward human-robot collaboration. Specifically, DaTAPlan planner computes actions for an agent and a human to collaboratively and jointly achieve the tasks anticipated by the LLM, and the agent automatically adapts to unexpected changes in human action outcomes and preferences. We evaluate DaTAPlan capabilities in a realistic simulation environment, demonstrating accurate task anticipation, effective human-robot collaboration, and the ability to adapt to unexpected changes. Project website: https://dataplan-hrc.github.io △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2404.03307 [pdf, other]

Bi-level Trajectory Optimization on Uneven Terrains with Differentiable Wheel-Terrain Interaction Model

Authors: Amith Manoharan, Aditya Sharma, Himani Belsare, Kaustab Pal, K. Madhava Krishna, Arun Kumar Singh

Abstract: Navigation of wheeled vehicles on uneven terrain necessitates going beyond the 2D approaches for trajectory planning. Specifically, it is essential to incorporate the full 6dof variation of vehicle pose and its associated stability cost in the planning process. To this end, most recent works aim to learn a neural network model to predict the vehicle evolution. However, such approaches are data-int… ▽ More Navigation of wheeled vehicles on uneven terrain necessitates going beyond the 2D approaches for trajectory planning. Specifically, it is essential to incorporate the full 6dof variation of vehicle pose and its associated stability cost in the planning process. To this end, most recent works aim to learn a neural network model to predict the vehicle evolution. However, such approaches are data-intensive and fraught with generalization issues. In this paper, we present a purely model-based approach that just requires the digital elevation information of the terrain. Specifically, we express the wheel-terrain interaction and 6dof pose prediction as a non-linear least squares (NLS) problem. As a result, trajectory planning can be viewed as a bi-level optimization. The inner optimization layer predicts the pose on the terrain along a given trajectory, while the outer layer deforms the trajectory itself to reduce the stability and kinematic costs of the pose. We improve the state-of-the-art in the following respects. First, we show that our NLS based pose prediction closely matches the output from a high-fidelity physics engine. This result coupled with the fact that we can query gradients of the NLS solver, makes our pose predictor, a differentiable wheel-terrain interaction model. We further leverage this differentiability to efficiently solve the proposed bi-level trajectory optimization problem. Finally, we perform extensive experiments, and comparison with a baseline to showcase the effectiveness of our approach in obtaining smooth, stable trajectories. △ Less

Submitted 11 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

Comments: 8 pages, 7 figures, submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

arXiv:2404.00910 [pdf, ps, other]

Unexpected Uncertainty Principle for Disc Banach Spaces

Authors: K. Mahesh Krishna

Abstract: Let $(\{f_n\}_{n=1}^\infty, \{τ_n\}_{n=1}^\infty)$ and $(\{g_n\}_{n=1}^\infty, \{ω_n\}_{n=1}^\infty)$ be unbounded continuous p-Schauder frames ($0<p<1$) for a disc Banach space $\mathcal{X}$. Then for every $x \in ( \mathcal{D}(θ_f) \cap\mathcal{D}(θ_g))\setminus\{0\}$, we show that \begin{align}\label{UB} (1) \quad \quad \quad \quad \|θ_f x\|_0\|θ_g x\|_0 \geq \frac{1}{\left(\displaystyle\sup_{n… ▽ More Let $(\{f_n\}_{n=1}^\infty, \{τ_n\}_{n=1}^\infty)$ and $(\{g_n\}_{n=1}^\infty, \{ω_n\}_{n=1}^\infty)$ be unbounded continuous p-Schauder frames ($0<p<1$) for a disc Banach space $\mathcal{X}$. Then for every $x \in ( \mathcal{D}(θ_f) \cap\mathcal{D}(θ_g))\setminus\{0\}$, we show that \begin{align}\label{UB} (1) \quad \quad \quad \quad \|θ_f x\|_0\|θ_g x\|_0 \geq \frac{1}{\left(\displaystyle\sup_{n,m \in \mathbb{N} }|f_n(ω_m)|\right)^p\left(\displaystyle\sup_{n, m \in \mathbb{N}}|g_m(τ_n)|\right)^p}, \end{align} where \begin{align*} & θ_f: \mathcal{D}(θ_f) \ni x \mapsto θ_fx := \{f_n(x)\}_{n=1}^\infty\in \ell^p(\mathbb{N}), \quad θ_g: \mathcal{D}(θ_g) \ni x \mapsto θ_gx := \{g_n(x)\}_{n=1}^\infty\in \ell^p(\mathbb{N}). \end{align*} Inequality (1) is unexpectedly different from both bounded uncertainty principle arXiv:2308.00312v1 and unbounded uncertainty principle arXiv:2312.00366v1 for Banach spaces. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: 6 Pages, 0 Figures

MSC Class: 42C15

arXiv:2403.20116 [pdf, other]

LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving

Authors: Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, K. Madhava Krishna

Abstract: Existing Vision-Language models (VLMs) estimate either long-term trajectory waypoints or a set of control actions as a reactive solution for closed-loop planning based on their rich scene comprehension. However, these estimations are coarse and are subjective to their "world understanding" which may generate sub-optimal decisions due to perception errors. In this paper, we introduce LeGo-Drive, wh… ▽ More Existing Vision-Language models (VLMs) estimate either long-term trajectory waypoints or a set of control actions as a reactive solution for closed-loop planning based on their rich scene comprehension. However, these estimations are coarse and are subjective to their "world understanding" which may generate sub-optimal decisions due to perception errors. In this paper, we introduce LeGo-Drive, which aims to address this issue by estimating a goal location based on the given language command as an intermediate representation in an end-to-end setting. The estimated goal might fall in a non-desirable region, like on top of a car for a parking-like command, leading to inadequate planning. Hence, we propose to train the architecture in an end-to-end manner, resulting in iterative refinement of both the goal and the trajectory collectively. We validate the effectiveness of our method through comprehensive experiments conducted in diverse simulated environments. We report significant improvements in standard autonomous driving metrics, with a goal reaching Success Rate of 81%. We further showcase the versatility of LeGo-Drive across different driving scenarios and linguistic inputs, underscoring its potential for practical deployment in autonomous vehicles and intelligent transportation systems. △ Less

Submitted 29 March, 2024; originally announced March 2024.

arXiv:2403.17946 [pdf, ps, other]

Nonlinear Heisenberg-Robertson-Schrodinger Uncertainty Principle

Authors: K. Mahesh Krishna

Abstract: We derive an uncertainty principle for Lipschitz maps acting on subsets of Banach spaces. We show that this nonlinear uncertainty principle reduces to the Heisenberg-Robertson-Schrodinger uncertainty principle for linear operators acting on Hilbert spaces. We derive an uncertainty principle for Lipschitz maps acting on subsets of Banach spaces. We show that this nonlinear uncertainty principle reduces to the Heisenberg-Robertson-Schrodinger uncertainty principle for linear operators acting on Hilbert spaces. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: 4 Pages, 0 Figures

MSC Class: 26A16; 46B99

arXiv:2402.08591 [pdf, ps, other]

Nonlinear Maccone-Pati Uncertainty Principle

Authors: K. Mahesh Krishna

Abstract: We show that one of the two important uncertainty principles derived by Maccone and Pati \textit{[Phys. Rev. Lett., 2014]} can be derived for arbitrary maps defined on subsets of $\mathcal{L}^p$ spaces for $1< p<\infty$. Our main tool is the Clarkson inequalities. We also derive a nonlinear uncertainty principle for weak parallelogram spaces and Type-p Banach spaces. We show that one of the two important uncertainty principles derived by Maccone and Pati \textit{[Phys. Rev. Lett., 2014]} can be derived for arbitrary maps defined on subsets of $\mathcal{L}^p$ spaces for $1< p<\infty$. Our main tool is the Clarkson inequalities. We also derive a nonlinear uncertainty principle for weak parallelogram spaces and Type-p Banach spaces. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 6 pages, 0 figures

MSC Class: 46B20; 46E30

arXiv:2402.04255 [pdf, ps, other]

Functional Kuppinger-Durisi-Bölcskei Uncertainty Principle

Authors: K. Mahesh Krishna

Abstract: Let $\mathcal{X}$ be a Banach space. Let $\{τ_j\}_{j=1}^n, \{ω_k\}_{k=1}^m\subseteq \mathcal{X}$ and $\{f_j\}_{j=1}^n$, $\{g_k\}_{k=1}^m\subseteq \mathcal{X}^*$ satisfy $ |f_j(τ_j)|\geq 1$ for all $ 1\leq j \leq n$, $|g_k(ω_k)|\geq 1 $ for all $1\leq k \leq m$. If $x \in \mathcal{X}\setminus \{0\}$ is such that $x=θ_τθ_f x=θ_ωθ_g x$, then we show that \begin{align}\label{FKDB} (1) \quad\quad\quad\… ▽ More Let $\mathcal{X}$ be a Banach space. Let $\{τ_j\}_{j=1}^n, \{ω_k\}_{k=1}^m\subseteq \mathcal{X}$ and $\{f_j\}_{j=1}^n$, $\{g_k\}_{k=1}^m\subseteq \mathcal{X}^*$ satisfy $ |f_j(τ_j)|\geq 1$ for all $ 1\leq j \leq n$, $|g_k(ω_k)|\geq 1 $ for all $1\leq k \leq m$. If $x \in \mathcal{X}\setminus \{0\}$ is such that $x=θ_τθ_f x=θ_ωθ_g x$, then we show that \begin{align}\label{FKDB} (1) \quad\quad\quad\quad \|θ_fx\|_0\|θ_gx\|_0\geq \frac{\bigg[1-(\|θ_fx\|_0-1)\max\limits_{1\leq j,r \leq n,j\neq r}|f_j(τ_r)|\bigg]^+\bigg[1-(\|θ_g x\|_0-1)\max\limits_{1\leq k,s \leq m,k\neq s}|g_k(ω_s)|\bigg]^+}{\left(\displaystyle\max_{1\leq j \leq n, 1\leq k \leq m}|f_j(ω_k)|\right)\left(\displaystyle\max_{1\leq j \leq n, 1\leq k \leq m}|g_k(τ_j)|\right)}. \end{align} We call Inequality (1) as \textbf{Functional Kuppinger-Durisi-Bölcskei Uncertainty Principle}. Inequality (1) improves the uncertainty principle obtained by Kuppinger, Durisi and Bölcskei \textit{[IEEE Trans. Inform. Theory (2012)]} (which improved the Donoho-Stark-Elad-Bruckstein uncertainty principle \textit{[SIAM J. Appl. Math. (1989), IEEE Trans. Inform. Theory (2002)]}). We also derive functional form of the uncertainity principle obtained by Studer, Kuppinger, Pope and Bölcskei \textit{[EEE Trans. Inform. Theory (2012)]}. △ Less

Submitted 1 January, 2024; originally announced February 2024.

Comments: 9 Pages, 0 Figures

MSC Class: 46A45; 46B45; 42C15

arXiv:2401.17399 [pdf, other]

ATPPNet: Attention based Temporal Point cloud Prediction Network

Authors: Kaustab Pal, Aditya Sharma, Avinash Sharma, K. Madhava Krishna

Abstract: Point cloud prediction is an important yet challenging task in the field of autonomous driving. The goal is to predict future point cloud sequences that maintain object structures while accurately representing their temporal motion. These predicted point clouds help in other subsequent tasks like object trajectory estimation for collision avoidance or estimating locations with the least odometry d… ▽ More Point cloud prediction is an important yet challenging task in the field of autonomous driving. The goal is to predict future point cloud sequences that maintain object structures while accurately representing their temporal motion. These predicted point clouds help in other subsequent tasks like object trajectory estimation for collision avoidance or estimating locations with the least odometry drift. In this work, we present ATPPNet, a novel architecture that predicts future point cloud sequences given a sequence of previous time step point clouds obtained with LiDAR sensor. ATPPNet leverages Conv-LSTM along with channel-wise and spatial attention dually complemented by a 3D-CNN branch for extracting an enhanced spatio-temporal context to recover high quality fidel predictions of future point clouds. We conduct extensive experiments on publicly available datasets and report impressive performance outperforming the existing methods. We also conduct a thorough ablative study of the proposed architecture and provide an application study that highlights the potential of our model for tasks like odometry estimation. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: Accepted for presentation at the 2024 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2401.02472 [pdf, ps, other]

Code Generation for a Variety of Accelerators for a Graph DSL

Authors: Ashwina Kumar, M. Venkata Krishna, Prasanna Bartakke, Rahul Kumar, Rajesh Pandian M, Nibedita Behera, Rupesh Nasre

Abstract: Sparse graphs are ubiquitous in real and virtual worlds. With the phenomenal growth in semi-structured and unstructured data, sizes of the underlying graphs have witnessed a rapid growth over the years. Analyzing such large structures necessitates parallel processing, which is challenged by the intrinsic irregularity of sparse computation, memory access, and communication. It would be ideal if pro… ▽ More Sparse graphs are ubiquitous in real and virtual worlds. With the phenomenal growth in semi-structured and unstructured data, sizes of the underlying graphs have witnessed a rapid growth over the years. Analyzing such large structures necessitates parallel processing, which is challenged by the intrinsic irregularity of sparse computation, memory access, and communication. It would be ideal if programmers and domain-experts get to focus only on the sequential computation and a compiler takes care of auto-generating the parallel code. On the other side, there is a variety in the number of target hardware devices, and achieving optimal performance often demands coding in specific languages or frameworks. Our goal in this work is to focus on a graph DSL which allows the domain-experts to write almost-sequential code, and generate parallel code for different accelerators from the same algorithmic specification. In particular, we illustrate code generation from the StarPlat graph DSL for NVIDIA, AMD, and Intel GPUs using CUDA, OpenCL, SYCL, and OpenACC programming languages. Using a suite of ten large graphs and four popular algorithms, we present the efficacy of StarPlat's versatile code generator. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: arXiv admin note: text overlap with arXiv:2305.03317

arXiv:2312.16648 [pdf, other]

LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization

Authors: Sai Shubodh Puligilla, Mohammad Omama, Husain Zaidi, Udit Singh Parihar, Madhava Krishna

Abstract: Global visual localization in LiDAR-maps, crucial for autonomous driving applications, remains largely unexplored due to the challenging issue of bridging the cross-modal heterogeneity gap. Popular multi-modal learning approach Contrastive Language-Image Pre-Training (CLIP) has popularized contrastive symmetric loss using batch construction technique by applying it to multi-modal domains of text a… ▽ More Global visual localization in LiDAR-maps, crucial for autonomous driving applications, remains largely unexplored due to the challenging issue of bridging the cross-modal heterogeneity gap. Popular multi-modal learning approach Contrastive Language-Image Pre-Training (CLIP) has popularized contrastive symmetric loss using batch construction technique by applying it to multi-modal domains of text and image. We apply this approach to the domains of 2D image and 3D LiDAR points on the task of cross-modal localization. Our method is explained as follows: A batch of N (image, LiDAR) pairs is constructed so as to predict what is the right match between N X N possible pairings across the batch by jointly training an image encoder and LiDAR encoder to learn a multi-modal embedding space. In this way, the cosine similarity between N positive pairings is maximized, whereas that between the remaining negative pairings is minimized. Finally, over the obtained similarity scores, a symmetric cross-entropy loss is optimized. To the best of our knowledge, this is the first work to apply batched loss approach to a cross-modal setting of image & LiDAR data and also to show Zero-shot transfer in a visual localization setting. We conduct extensive analyses on standard autonomous driving datasets such as KITTI and KITTI-360 datasets. Our method outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by 22.4%, using only perspective images, in contrast to the state-of-the-art approach, which utilizes the more informative fisheye images. Additionally, this superior performance is achieved without resorting to complex architectures. Moreover, we demonstrate the zero-shot capabilities of our model and we beat SOTA by 8% without even training on it. Furthermore, we establish the first benchmark for cross-modal localization on the KITTI dataset. △ Less

Submitted 27 December, 2023; originally announced December 2023.

Comments: To be presented at WACV-W 2024. Project page: https://shubodhs.ai/liploc

arXiv:2312.00366 [pdf, ps, other]

Unbounded Donoho-Stark-Elad-Bruckstein-Ricaud-Torrésani Uncertainty Principles

Authors: K. Mahesh Krishna

Abstract: Let $(Ω, μ)$, $(Δ, ν)$ be measure spaces and $p=1$ or $p=\infty$. Let $(\{f_α\}_{α\in Ω}, \{τ_α\}_{α\in Ω})$ and $(\{g_β\}_{β\in Δ}, \{ω_β\}_{β\in Δ})$ be unbounded continuous p-Schauder frames for a Banach space $\mathcal{X}$. Then for every $x \in ( \mathcal{D}(θ_f) \cap\mathcal{D}(θ_g))\setminus\{0\}$, we show that \begin{align}\label{UB} (1) \quad \quad \quad \quad μ(\operatorname{supp}(θ_f… ▽ More Let $(Ω, μ)$, $(Δ, ν)$ be measure spaces and $p=1$ or $p=\infty$. Let $(\{f_α\}_{α\in Ω}, \{τ_α\}_{α\in Ω})$ and $(\{g_β\}_{β\in Δ}, \{ω_β\}_{β\in Δ})$ be unbounded continuous p-Schauder frames for a Banach space $\mathcal{X}$. Then for every $x \in ( \mathcal{D}(θ_f) \cap\mathcal{D}(θ_g))\setminus\{0\}$, we show that \begin{align}\label{UB} (1) \quad \quad \quad \quad μ(\operatorname{supp}(θ_f x))ν(\operatorname{supp}(θ_g x)) \geq \frac{1}{\left(\displaystyle\sup_{α\in Ω, β\in Δ}|f_α(ω_β)|\right)\left(\displaystyle\sup_{α\in Ω, β\in Δ}|g_β(τ_α)|\right)}, \end{align} where \begin{align*} &θ_f:\mathcal{D}(θ_f) \ni x \mapsto θ_fx \in \mathcal{L}^p(Ω, μ); \quad θ_fx: Ω\ni α\mapsto (θ_fx) (α):= f_α(x) \in \mathbb{K},\\ &θ_g: \mathcal{D}(θ_g) \ni x \mapsto θ_gx \in \mathcal{L}^p(Δ, ν); \quad θ_gx: Δ\ni β\mapsto (θ_gx) (β):= g_β(x) \in \mathbb{K}. \end{align*} We call Inequality (1) as \textbf{Unbounded Donoho-Stark-Elad-Bruckstein-Ricaud-Torrésani Uncertainty Principle}. Along with recent \textbf{Functional Continuous Uncertainty Principle} [arXiv:2308.00312], Inequality (1) also improves Ricaud-Torrésani uncertainty principle [IEEE Trans. Inform. Theory, 2013]. In particular, it improves Elad-Bruckstein uncertainty principle [IEEE Trans. Inform. Theory, 2002] and Donoho-Stark uncertainty principle [SIAM J. Appl. Math., 1989]. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: 6 Figures, 0 Figures

MSC Class: 42C15

arXiv:2311.14635 [pdf]

Automated Detection and Counting of Windows using UAV Imagery based Remote Sensing

Authors: Dhruv Patel, Shivani Chepuri, Sarvesh Thakur, K. Harikumar, Ravi Kiran S., K. Madhava Krishna

Abstract: Despite the technological advancements in the construction and surveying sector, the inspection of salient features like windows in an under-construction or existing building is predominantly a manual process. Moreover, the number of windows present in a building is directly related to the magnitude of deformation it suffers under earthquakes. In this research, a method to accurately detect and co… ▽ More Despite the technological advancements in the construction and surveying sector, the inspection of salient features like windows in an under-construction or existing building is predominantly a manual process. Moreover, the number of windows present in a building is directly related to the magnitude of deformation it suffers under earthquakes. In this research, a method to accurately detect and count the number of windows of a building by deploying an Unmanned Aerial Vehicle (UAV) based remote sensing system is proposed. The proposed two-stage method automates the identification and counting of windows by developing computer vision pipelines that utilize data from UAV's onboard camera and other sensors. Quantitative and Qualitative results show the effectiveness of our proposed approach in accurately detecting and counting the windows compared to the existing method. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2310.13077 [pdf, other]

doi 10.1109/CASE56687.2023.10260513

NeuroSMPC: A Neural Network guided Sampling Based MPC for On-Road Autonomous Driving

Authors: Kaustab Pal, Aditya Sharma, Mohd Omama, Parth N. Shah, K. Madhava Krishna

Abstract: In this paper we show an effective means of integrating data driven frameworks to sampling based optimal control to vastly reduce the compute time for easy adoption and adaptation to real time applications such as on-road autonomous driving in the presence of dynamic actors. Presented with training examples, a spatio-temporal CNN learns to predict the optimal mean control over a finite horizon tha… ▽ More In this paper we show an effective means of integrating data driven frameworks to sampling based optimal control to vastly reduce the compute time for easy adoption and adaptation to real time applications such as on-road autonomous driving in the presence of dynamic actors. Presented with training examples, a spatio-temporal CNN learns to predict the optimal mean control over a finite horizon that precludes further resampling, an iterative process that makes sampling based optimal control formulations difficult to adopt in real time settings. Generating control samples around the network-predicted optimal mean retains the advantage of sample diversity while enabling real time rollout of trajectories that avoids multiple dynamic obstacles in an on-road navigation setting. Further the 3D CNN architecture implicitly learns the future trajectories of the dynamic agents in the scene resulting in successful collision free navigation despite no explicit future trajectory prediction. We show performance gain over multiple baselines in a number of on-road scenes through closed loop simulations in CARLA. We also showcase the real world applicability of our system by running it on our custom Autonomous Driving Platform (AutoDP). △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: Published in 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE)

arXiv:2310.08270 [pdf, other]

Hilbert Space Embedding-based Trajectory Optimization for Multi-Modal Uncertain Obstacle Trajectory Prediction

Authors: Basant Sharma, Aditya Sharma, K. Madhava Krishna, Arun Kumar Singh

Abstract: Safe autonomous driving critically depends on how well the ego-vehicle can predict the trajectories of neighboring vehicles. To this end, several trajectory prediction algorithms have been presented in the existing literature. Many of these approaches output a multi-modal distribution of obstacle trajectories instead of a single deterministic prediction to account for the underlying uncertainty. H… ▽ More Safe autonomous driving critically depends on how well the ego-vehicle can predict the trajectories of neighboring vehicles. To this end, several trajectory prediction algorithms have been presented in the existing literature. Many of these approaches output a multi-modal distribution of obstacle trajectories instead of a single deterministic prediction to account for the underlying uncertainty. However, existing planners cannot handle the multi-modality based on just sample-level information of the predictions. With this motivation, this paper proposes a trajectory optimizer that can leverage the distributional aspects of the prediction in a computationally tractable and sample-efficient manner. Our optimizer can work with arbitrarily complex distributions and thus can be used with output distribution represented as a deep neural network. The core of our approach is built on embedding distribution in Reproducing Kernel Hilbert Space (RKHS), which we leverage in two ways. First, we propose an RKHS embedding approach to select probable samples from the obstacle trajectory distribution. Second, we rephrase chance-constrained optimization as distribution matching in RKHS and propose a novel sampling-based optimizer for its solution. We validate our approach with hand-crafted and neural network-based predictors trained on real-world datasets and show improvement over the existing stochastic optimization approaches in safety metrics. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2310.04802 [pdf, other]

Hierarchical Unsupervised Topological SLAM

Authors: Ayush Sharma, Yash Mehan, Pradyumna Dasu, Sourav Garg, Madhava Krishna

Abstract: In this paper we present a novel framework for unsupervised topological clustering resulting in improved loop. In this paper we present a novel framework for unsupervised topological clustering resulting in improved loop detection and closure for SLAM. A navigating mobile robot clusters its traversal into visually similar topologies where each cluster (topology) contains a set of similar looking i… ▽ More In this paper we present a novel framework for unsupervised topological clustering resulting in improved loop. In this paper we present a novel framework for unsupervised topological clustering resulting in improved loop detection and closure for SLAM. A navigating mobile robot clusters its traversal into visually similar topologies where each cluster (topology) contains a set of similar looking images typically observed from spatially adjacent locations. Each such set of spatially adjacent and visually similar grouping of images constitutes a topology obtained without any supervision. We formulate a hierarchical loop discovery strategy that first detects loops at the level of topologies and subsequently at the level of images between the looped topologies. We show over a number of traversals across different Habitat environments that such a hierarchical pipeline significantly improves SOTA image based loop detection and closure methods. Further, as a consequence of improved loop detection, we enhance the loop closure and backend SLAM performance. Such a rendering of a traversal into topological segments is beneficial for downstream tasks such as navigation that can now build a topological graph where spatially adjacent topological clusters are connected by an edge and navigate over such topological graphs. △ Less

Submitted 7 October, 2023; originally announced October 2023.

Comments: Accepted to IEEE ITSC 2023

arXiv:2310.04181 [pdf, other]

DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions

Authors: Sanket Kalwar, Mihir Ungarala, Shruti Jain, Aaron Monis, Krishna Reddy Konda, Sourav Garg, K Madhava Krishna

Abstract: Semantic segmentation in adverse weather scenarios is a critical task for autonomous driving systems. While foundation models have shown promise, the need for specialized adaptors becomes evident for handling more challenging scenarios. We introduce DiffPrompter, a novel differentiable visual and latent prompting mechanism aimed at expanding the learning capabilities of existing adaptors in founda… ▽ More Semantic segmentation in adverse weather scenarios is a critical task for autonomous driving systems. While foundation models have shown promise, the need for specialized adaptors becomes evident for handling more challenging scenarios. We introduce DiffPrompter, a novel differentiable visual and latent prompting mechanism aimed at expanding the learning capabilities of existing adaptors in foundation models. Our proposed $\nabla$HFC image processing block excels particularly in adverse weather conditions, where conventional methods often fall short. Furthermore, we investigate the advantages of jointly training visual and latent prompts, demonstrating that this combined approach significantly enhances performance in out-of-distribution scenarios. Our differentiable visual prompts leverage parallel and series architectures to generate prompts, effectively improving object segmentation tasks in adverse conditions. Through a comprehensive series of experiments and evaluations, we provide empirical evidence to support the efficacy of our approach. Project page at https://diffprompter.github.io. △ Less

Submitted 26 March, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.02324 [pdf, other]

ALT-Pilot: Autonomous navigation with Language augmented Topometric maps

Authors: Mohammad Omama, Pranav Inani, Pranjal Paul, Sarat Chandra Yellapragada, Krishna Murthy Jatavallabhula, Sandeep Chinchali, Madhava Krishna

Abstract: We present an autonomous navigation system that operates without assuming HD LiDAR maps of the environment. Our system, ALT-Pilot, relies only on publicly available road network information and a sparse (and noisy) set of crowdsourced language landmarks. With the help of onboard sensors and a language-augmented topometric map, ALT-Pilot autonomously pilots the vehicle to any destination on the roa… ▽ More We present an autonomous navigation system that operates without assuming HD LiDAR maps of the environment. Our system, ALT-Pilot, relies only on publicly available road network information and a sparse (and noisy) set of crowdsourced language landmarks. With the help of onboard sensors and a language-augmented topometric map, ALT-Pilot autonomously pilots the vehicle to any destination on the road network. We achieve this by leveraging vision-language models pre-trained on web-scale data to identify potential landmarks in a scene, incorporating vision-language features into the recursive Bayesian state estimation stack to generate global (route) plans, and a reactive trajectory planner and controller operating in the vehicle frame. We implement and evaluate ALT-Pilot in simulation and on a real, full-scale autonomous vehicle and report improvements over state-of-the-art topometric navigation systems by a factor of 3x on localization accuracy and 5x on goal reachability △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2310.02251 [pdf, other]

Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving

Authors: Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, K. Madhava Krishna

Abstract: Talk2BEV is a large vision-language model (LVLM) interface for bird's-eye view (BEV) maps in autonomous driving contexts. While existing perception systems for autonomous driving scenarios have largely focused on a pre-defined (closed) set of object categories and driving scenarios, Talk2BEV blends recent advances in general-purpose language and vision models with BEV-structured map representation… ▽ More Talk2BEV is a large vision-language model (LVLM) interface for bird's-eye view (BEV) maps in autonomous driving contexts. While existing perception systems for autonomous driving scenarios have largely focused on a pre-defined (closed) set of object categories and driving scenarios, Talk2BEV blends recent advances in general-purpose language and vision models with BEV-structured map representations, eliminating the need for task-specific models. This enables a single system to cater to a variety of autonomous driving tasks encompassing visual and spatial reasoning, predicting the intents of traffic actors, and decision-making based on visual cues. We extensively evaluate Talk2BEV on a large number of scene understanding tasks that rely on both the ability to interpret free-form natural language queries, and in grounding these queries to the visual context embedded into the language-enhanced BEV map. To enable further research in LVLMs for autonomous driving scenarios, we develop and release Talk2BEV-Bench, a benchmark encompassing 1000 human-annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset. △ Less

Submitted 14 November, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: Project page at https://llmbev.github.io/talk2bev/

arXiv:2309.11414 [pdf, other]

EDMP: Ensemble-of-costs-guided Diffusion for Motion Planning

Authors: Kallol Saha, Vishal Mandadi, Jayaram Reddy, Ajit Srikanth, Aditya Agarwal, Bipasha Sen, Arun Singh, Madhava Krishna

Abstract: Classical motion planning for robotic manipulation includes a set of general algorithms that aim to minimize a scene-specific cost of executing a given plan. This approach offers remarkable adaptability, as they can be directly used off-the-shelf for any new scene without needing specific training datasets. However, without a prior understanding of what diverse valid trajectories are and without s… ▽ More Classical motion planning for robotic manipulation includes a set of general algorithms that aim to minimize a scene-specific cost of executing a given plan. This approach offers remarkable adaptability, as they can be directly used off-the-shelf for any new scene without needing specific training datasets. However, without a prior understanding of what diverse valid trajectories are and without specially designed cost functions for a given scene, the overall solutions tend to have low success rates. While deep-learning-based algorithms tremendously improve success rates, they are much harder to adopt without specialized training datasets. We propose EDMP, an Ensemble-of-costs-guided Diffusion for Motion Planning that aims to combine the strengths of classical and deep-learning-based motion planning. Our diffusion-based network is trained on a set of diverse kinematically valid trajectories. Like classical planning, for any new scene at the time of inference, we compute scene-specific costs such as "collision cost" and guide the diffusion to generate valid trajectories that satisfy the scene-specific constraints. Further, instead of a single cost function that may be insufficient in capturing diversity across scenes, we use an ensemble of costs to guide the diffusion process, significantly improving the success rate compared to classical planners. EDMP performs comparably with SOTA deep-learning-based methods while retaining the generalization capabilities primarily associated with classical planners. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: 8 pages, 8 figures, submitted to ICRA 2024 (International Conference on Robotics and Automation)

arXiv:2308.00688 [pdf, other]

AnyLoc: Towards Universal Visual Place Recognition

Authors: Nikhil Keetha, Avneesh Mishra, Jay Karhade, Krishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, Sourav Garg

Abstract: Visual Place Recognition (VPR) is vital for robot localization. To date, the most performant VPR approaches are environment- and task-specific: while they exhibit strong performance in structured environments (predominantly urban driving), their performance degrades severely in unstructured environments, rendering most approaches brittle to robust real-world deployment. In this work, we develop a… ▽ More Visual Place Recognition (VPR) is vital for robot localization. To date, the most performant VPR approaches are environment- and task-specific: while they exhibit strong performance in structured environments (predominantly urban driving), their performance degrades severely in unstructured environments, rendering most approaches brittle to robust real-world deployment. In this work, we develop a universal solution to VPR -- a technique that works across a broad range of structured and unstructured environments (urban, outdoors, indoors, aerial, underwater, and subterranean environments) without any re-training or fine-tuning. We demonstrate that general-purpose feature representations derived from off-the-shelf self-supervised models with no VPR-specific training are the right substrate upon which to build such a universal VPR solution. Combining these derived features with unsupervised feature aggregation enables our suite of methods, AnyLoc, to achieve up to 4X significantly higher performance than existing approaches. We further obtain a 6% improvement in performance by characterizing the semantic properties of these features, uncovering unique domains which encapsulate datasets from similar environments. Our detailed experiments and analysis lay a foundation for building VPR solutions that may be deployed anywhere, anytime, and across anyview. We encourage the readers to explore our project page and interactive demos: https://anyloc.github.io/. △ Less

Submitted 28 November, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

Comments: IEEE RA-L 2023 (Presented at ICRA 2024)

arXiv:2307.01215 [pdf, ps, other]

Functional Donoho-Stark Approximate Support Uncertainty Principle

Authors: K. Mahesh Krishna

Abstract: Let $(\{f_j\}_{j=1}^n, \{τ_j\}_{j=1}^n)$ and $(\{g_k\}_{k=1}^n, \{ω_k\}_{k=1}^n)$ be two p-orthonormal bases for a finite dimensional Banach space $\mathcal{X}$. If $ x \in \mathcal{X}\setminus\{0\}$ is such that $θ_fx$ is $\varepsilon$-supported on $M\subseteq \{1,\dots, n\}$ w.r.t. p-norm and $θ_gx$ is $δ$-supported on $N\subseteq \{1,\dots, n\}$ w.r.t. p-norm, then we show that \begin{align}\la… ▽ More Let $(\{f_j\}_{j=1}^n, \{τ_j\}_{j=1}^n)$ and $(\{g_k\}_{k=1}^n, \{ω_k\}_{k=1}^n)$ be two p-orthonormal bases for a finite dimensional Banach space $\mathcal{X}$. If $ x \in \mathcal{X}\setminus\{0\}$ is such that $θ_fx$ is $\varepsilon$-supported on $M\subseteq \{1,\dots, n\}$ w.r.t. p-norm and $θ_gx$ is $δ$-supported on $N\subseteq \{1,\dots, n\}$ w.r.t. p-norm, then we show that \begin{align}\label{ME} (1) \quad \quad \quad \quad &o(M)^\frac{1}{p}o(N)^\frac{1}{q}\geq \frac{1}{\displaystyle \max_{1\leq j,k\leq n}|f_j(ω_k) |}\max \{1-\varepsilon-δ, 0\},\\ (2) \quad \quad \quad \quad&o(M)^\frac{1}{q}o(N)^\frac{1}{p}\geq \frac{1}{\displaystyle \max_{1\leq j,k\leq n}|g_k(τ_j) |}\max \{1-\varepsilon-δ, 0\},\label{ME2} \end{align} where \begin{align*} θ_f: \mathcal{X} \ni x \mapsto (f_j(x) )_{j=1}^n \in \ell^p([n]); \quad θ_g: \mathcal{X} \ni x \mapsto (g_k(x) )_{k=1}^n \in \ell^p([n]) \end{align*} and $q$ is the conjugate index of $p$. We call Inequalities (1) and (2) as \textbf{Functional Donoho-Stark Approximate Support Uncertainty Principle}. Inequalities (1) and (2) improve the finite approximate support uncertainty principle obtained by Donoho and Stark \textit{[SIAM J. Appl. Math., 1989]}. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Comments: 7 Pages, 0 Figures

MSC Class: 42C15; 46B03; 46B04

arXiv:2306.06093 [pdf, other]

HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Authors: Bipasha Sen, Gaurav Singh, Aditya Agarwal, Rohith Agaram, K Madhava Krishna, Srinath Sridhar

Abstract: Neural Radiance Fields (NeRF) have become an increasingly popular representation to capture high-quality appearance and shape of scenes and objects. However, learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to… ▽ More Neural Radiance Fields (NeRF) have become an increasingly popular representation to capture high-quality appearance and shape of scenes and objects. However, learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to improve quality, we propose HyP-NeRF, a latent conditioning method for learning generalizable category-level NeRF priors using hypernetworks. Rather than using hypernetworks to estimate only the weights of a NeRF, we estimate both the weights and the multi-resolution hash encodings resulting in significant quality gains. To improve quality even further, we incorporate a denoise and finetune strategy that denoises images rendered from NeRFs estimated by the hypernetwork and finetunes it while retaining multiview consistency. These improvements enable us to use HyP-NeRF as a generalizable prior for multiple downstream tasks including NeRF reconstruction from single-view or cluttered scenes and text-to-NeRF. We provide qualitative comparisons and evaluate HyP-NeRF on three tasks: generalization, compression, and retrieval, demonstrating our state-of-the-art results. △ Less

Submitted 23 December, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: Project Page: https://hyp-nerf.github.io

arXiv:2306.04939 [pdf, other]

UAP-BEV: Uncertainty Aware Planning using Bird's Eye View generated from Surround Monocular Images

Authors: Vikrant Dewangan, Basant Sharma, Tushar Choudhary, Sarthak Sharma, Aakash Aanegola, Arun K. Singh, K. Madhava Krishna

Abstract: Autonomous driving requires accurate reasoning of the location of objects from raw sensor data. Recent end-to-end learning methods go from raw sensor data to a trajectory output via Bird's Eye View(BEV) segmentation as an interpretable intermediate representation. Motion planning over cost maps generated via Birds Eye View (BEV) segmentation has emerged as a prominent approach in autonomous drivin… ▽ More Autonomous driving requires accurate reasoning of the location of objects from raw sensor data. Recent end-to-end learning methods go from raw sensor data to a trajectory output via Bird's Eye View(BEV) segmentation as an interpretable intermediate representation. Motion planning over cost maps generated via Birds Eye View (BEV) segmentation has emerged as a prominent approach in autonomous driving. However, the current approaches have two critical gaps. First, the optimization process is simplistic and involves just evaluating a fixed set of trajectories over the cost map. The trajectory samples are not adapted based on their associated cost values. Second, the existing cost maps do not account for the uncertainty in the cost maps that can arise due to noise in RGB images, and BEV annotations. As a result, these approaches can struggle in challenging scenarios where there is abrupt cut-in, stopping, overtaking, merging, etc from the neighboring vehicles. In this paper, we propose UAP-BEV: A novel approach that models the noise in Spatio-Temporal BEV predictions to create an uncertainty-aware occupancy grid map. Using queries of the distance to the closest occupied cell, we obtain a sample estimate of the collision probability of the ego-vehicle. Subsequently, our approach uses gradient-free sampling-based optimization to compute low-cost trajectories over the cost map. Importantly, the sampling distribution is adapted based on the optimal cost values of the sampled trajectories. By explicitly modeling probabilistic collision avoidance in the BEV space, our approach is able to outperform the cost-map-based baselines in collision avoidance, route completion, time to completion, and smoothness. To further validate our method, we also show results on the real-world dataset NuScenes, where we report improvements in collision avoidance and smoothness. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: Accepted to CASE 2023. Project video available at https://vikr-182.github.io/UAP-BEV

arXiv:2306.01540 [pdf, other]

CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities

Authors: Ayush Agrawal, Raghav Arora, Ahana Datta, Snehasis Banerjee, Brojeshwar Bhowmick, Krishna Murthy Jatavallabhula, Mohan Sridharan, Madhava Krishna

Abstract: This paper introduces a novel method for determining the best room to place an object in, for embodied scene rearrangement. While state-of-the-art approaches rely on large language models (LLMs) or reinforcement learned (RL) policies for this task, our approach, CLIPGraphs, efficiently combines commonsense domain knowledge, data-driven methods, and recent advances in multimodal learning. Specifica… ▽ More This paper introduces a novel method for determining the best room to place an object in, for embodied scene rearrangement. While state-of-the-art approaches rely on large language models (LLMs) or reinforcement learned (RL) policies for this task, our approach, CLIPGraphs, efficiently combines commonsense domain knowledge, data-driven methods, and recent advances in multimodal learning. Specifically, it (a)encodes a knowledge graph of prior human preferences about the room location of different objects in home environments, (b) incorporates vision-language features to support multimodal queries based on images or text, and (c) uses a graph network to learn object-room affinities based on embeddings of the prior knowledge and the vision-language features. We demonstrate that our approach provides better estimates of the most appropriate location of objects from a benchmark set of object categories in comparison with state-of-the-art baselines △ Less

Submitted 2 June, 2023; originally announced June 2023.

Journal ref: RO-MAN 2023 Conference

arXiv:2306.01014 [pdf, ps, other]

Functional Ghobber-Jaming Uncertainty Principle

Authors: K. Mahesh Krishna

Abstract: Let $(\{f_j\}_{j=1}^n, \{τ_j\}_{j=1}^n)$ and $(\{g_k\}_{k=1}^n, \{ω_k\}_{k=1}^n)$ be two p-orthonormal bases for a finite dimensional Banach space $\mathcal{X}$. Let $M,N\subseteq \{1, \dots, n\}$ be such that \begin{align*} o(M)^\frac{1}{q}o(N)^\frac{1}{p}< \frac{1}{\displaystyle \max_{1\leq j,k\leq n}|g_k(τ_j) |}, \end{align*} where $q$ is the conjugate index of $p$. Then for all… ▽ More Let $(\{f_j\}_{j=1}^n, \{τ_j\}_{j=1}^n)$ and $(\{g_k\}_{k=1}^n, \{ω_k\}_{k=1}^n)$ be two p-orthonormal bases for a finite dimensional Banach space $\mathcal{X}$. Let $M,N\subseteq \{1, \dots, n\}$ be such that \begin{align*} o(M)^\frac{1}{q}o(N)^\frac{1}{p}< \frac{1}{\displaystyle \max_{1\leq j,k\leq n}|g_k(τ_j) |}, \end{align*} where $q$ is the conjugate index of $p$. Then for all $x \in \mathcal{X}$, we show that \begin{align}\label{FGJU} (1) \quad \quad \quad \quad \|x\|\leq \left(1+\frac{1}{1-o(M)^\frac{1}{q}o(N)^\frac{1}{p}\displaystyle\max_{1\leq j,k\leq n}|g_k(τ_j)|}\right)\left[\left(\sum_{j\in M^c}|f_j(x)|^p\right)^\frac{1}{p}+\left(\sum_{k\in N^c}|g_k(x) |^p\right)^\frac{1}{p}\right]. \end{align} We call Inequality (1) as \textbf{Functional Ghobber-Jaming Uncertainty Principle}. Inequality (1) improves the uncertainty principle obtained by Ghobber and Jaming \textit{[Linear Algebra Appl., 2011]}. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 7 Pages, 0 Figures

MSC Class: 42C15; 46B03; 46B04

arXiv:2305.12363 [pdf, other]

doi 10.1109/RO-MAN57019.2023.10309534

Instance-Level Semantic Maps for Vision Language Navigation

Authors: Laksh Nanwani, Anmol Agarwal, Kanishk Jain, Raghav Prabhakar, Aaron Monis, Aditya Mathur, Krishna Murthy, Abdul Hafez, Vineet Gandhi, K. Madhava Krishna

Abstract: Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment, allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recent works take a step towards this… ▽ More Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment, allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recent works take a step towards this goal by creating a semantic spatial map representation of the environment without any labeled data. However, their representations are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we address this limitation by integrating instance-level information into spatial map representation using a community detection algorithm and utilizing word ontology learned by large language models (LLMs) to perform open-set semantic associations in the mapping representation. The resulting map representation improves the navigation performance by two-fold (233%) on realistic language commands with instance-specific descriptions compared to the baseline. We validate the practicality and effectiveness of our approach through extensive qualitative and quantitative experiments. △ Less

Submitted 1 July, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Journal ref: IEEE RO-MAN 2023

arXiv:2305.06178 [pdf]

Sequence-Agnostic Multi-Object Navigation

Authors: Nandiraju Gireesh, Ayush Agrawal, Ahana Datta, Snehasis Banerjee, Mohan Sridharan, Brojeshwar Bhowmick, Madhava Krishna

Abstract: The Multi-Object Navigation (MultiON) task requires a robot to localize an instance (each) of multiple object classes. It is a fundamental task for an assistive robot in a home or a factory. Existing methods for MultiON have viewed this as a direct extension of Object Navigation (ON), the task of localising an instance of one object class, and are pre-sequenced, i.e., the sequence in which the obj… ▽ More The Multi-Object Navigation (MultiON) task requires a robot to localize an instance (each) of multiple object classes. It is a fundamental task for an assistive robot in a home or a factory. Existing methods for MultiON have viewed this as a direct extension of Object Navigation (ON), the task of localising an instance of one object class, and are pre-sequenced, i.e., the sequence in which the object classes are to be explored is provided in advance. This is a strong limitation in practical applications characterized by dynamic changes. This paper describes a deep reinforcement learning framework for sequence-agnostic MultiON based on an actor-critic architecture and a suitable reward specification. Our framework leverages past experiences and seeks to reward progress toward individual as well as multiple target object classes. We use photo-realistic scenes from the Gibson benchmark dataset in the AI Habitat 3D simulation environment to experimentally show that our method performs better than a pre-sequenced approach and a state of the art ON method extended to MultiON. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Journal ref: ICRA 2023 conference

arXiv:2304.03324 [pdf, ps, other]

Functional Donoho-Stark-Elad-Bruckstein-Ricaud-Torrésani Uncertainty Principle

Authors: K. Mahesh Krishna

Abstract: Let $(\{f_j\}_{j=1}^n, \{τ_j\}_{j=1}^n)$ and $(\{g_k\}_{k=1}^m, \{ω_k\}_{k=1}^m)$ be p-Schauder frames for a finite dimensional Banach space $\mathcal{X}$. Then for every $x \in \mathcal{X}\setminus\{0\}$, we show that \begin{align} (1) \quad \|θ_f x\|_0^\frac{1}{p}\|θ_g x\|_0^\frac{1}{q} \geq \frac{1}{\displaystyle\max_{1\leq j\leq n, 1\leq k\leq m}|f_j(ω_k)|}\quad \text{and} \quad \|θ_g x\|_0^\f… ▽ More Let $(\{f_j\}_{j=1}^n, \{τ_j\}_{j=1}^n)$ and $(\{g_k\}_{k=1}^m, \{ω_k\}_{k=1}^m)$ be p-Schauder frames for a finite dimensional Banach space $\mathcal{X}$. Then for every $x \in \mathcal{X}\setminus\{0\}$, we show that \begin{align} (1) \quad \|θ_f x\|_0^\frac{1}{p}\|θ_g x\|_0^\frac{1}{q} \geq \frac{1}{\displaystyle\max_{1\leq j\leq n, 1\leq k\leq m}|f_j(ω_k)|}\quad \text{and} \quad \|θ_g x\|_0^\frac{1}{p}\|θ_f x\|_0^\frac{1}{q}\geq \frac{1}{\displaystyle\max_{1\leq j\leq n, 1\leq k\leq m}|g_k(τ_j)|}. \end{align} where \begin{align*} θ_f: \mathcal{X} \ni x \mapsto (f_j(x) )_{j=1}^n \in \ell^p([n]); \quad θ_g: \mathcal{X} \ni x \mapsto (g_k(x) )_{k=1}^m \in \ell^p([m]) \end{align*} and $q$ is the conjugate index of $p$. We call Inequality (1) as \textbf{Functional Donoho-Stark-Elad-Bruckstein-Ricaud-Torrésani Uncertainty Principle}. Inequality (1) improves Ricaud-Torrésani uncertainty principle \textit{[IEEE Trans. Inform. Theory, 2013]}. In particular, it improves Elad-Bruckstein uncertainty principle \textit{[IEEE Trans. Inform. Theory, 2002]} and Donoho-Stark uncertainty principle \textit{[SIAM J. Appl. Math., 1989]}. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: 5 Pages, 0 Figures

MSC Class: 42C15

arXiv:2304.01074 [pdf, other]

FinderNet: A Data Augmentation Free Canonicalization aided Loop Detection and Closure technique for Point clouds in 6-DOF separation

Authors: Sudarshan S Harithas, Gurkirat Singh, Aneesh Chavan, Sarthak Sharma, Suraj Patni, Chetan Arora, K. Madhava Krishna

Abstract: We focus on the problem of LiDAR point cloud based loop detection (or Finding) and closure (LDC) in a multi-agent setting. State-of-the-art (SOTA) techniques directly generate learned embeddings of a given point cloud, require large data transfers, and are not robust to wide variations in 6 Degrees-of-Freedom (DOF) viewpoint. Moreover, absence of strong priors in an unstructured point cloud leads… ▽ More We focus on the problem of LiDAR point cloud based loop detection (or Finding) and closure (LDC) in a multi-agent setting. State-of-the-art (SOTA) techniques directly generate learned embeddings of a given point cloud, require large data transfers, and are not robust to wide variations in 6 Degrees-of-Freedom (DOF) viewpoint. Moreover, absence of strong priors in an unstructured point cloud leads to highly inaccurate LDC. In this original approach, we propose independent roll and pitch canonicalization of the point clouds using a common dominant ground plane. Discretization of the canonicalized point cloud along the axis perpendicular to the ground plane leads to an image similar to Digital Elevation Maps (DEMs), which exposes strong spatial priors in the scene. Our experiments show that LDC based on learnt embeddings of such DEMs is not only data efficient but also significantly more robust, and generalizable than the current SOTA. We report significant performance gain in terms of Average Precision for loop detection and absolute translation/rotation error for relative pose estimation (or loop closure) on Kitti, GPR and Oxford Robot Car over multiple SOTA LDC methods. Our encoder technique allows to compress the original point cloud by over 830 times. To further test the robustness of our technique we create and opensource a custom dataset called Lidar-UrbanFly Dataset (LUF) which consists of point clouds obtained from a LiDAR mounted on a quadrotor. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2302.07241 [pdf, other]

ConceptFusion: Open-set Multimodal 3D Mapping

Authors: Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Alaa Maalouf, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, Antonio Torralba

Abstract: Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent wor… ▽ More Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs △ Less

Submitted 23 October, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: RSS 2023. Project page: https://concept-fusion.github.io Explainer video: https://www.youtube.com/watch?v=rkXgws8fiDs Code: https://github.com/concept-fusion/concept-fusion

arXiv:2301.07213 [pdf, other]

SCARP: 3D Shape Completion in ARbitrary Poses for Improved Grasping

Authors: Bipasha Sen, Aditya Agarwal, Gaurav Singh, Brojeshwar B., Srinath Sridhar, Madhava Krishna

Abstract: Recovering full 3D shapes from partial observations is a challenging task that has been extensively addressed in the computer vision community. Many deep learning methods tackle this problem by training 3D shape generation networks to learn a prior over the full 3D shapes. In this training regime, the methods expect the inputs to be in a fixed canonical form, without which they fail to learn a val… ▽ More Recovering full 3D shapes from partial observations is a challenging task that has been extensively addressed in the computer vision community. Many deep learning methods tackle this problem by training 3D shape generation networks to learn a prior over the full 3D shapes. In this training regime, the methods expect the inputs to be in a fixed canonical form, without which they fail to learn a valid prior over the 3D shapes. We propose SCARP, a model that performs Shape Completion in ARbitrary Poses. Given a partial pointcloud of an object, SCARP learns a disentangled feature representation of pose and shape by relying on rotationally equivariant pose features and geometric shape features trained using a multi-tasking objective. Unlike existing methods that depend on an external canonicalization, SCARP performs canonicalization, pose estimation, and shape completion in a single network, improving the performance by 45% over the existing baselines. In this work, we use SCARP for improving grasp proposals on tabletop objects. By completing partial tabletop objects directly in their observed poses, SCARP enables a SOTA grasp proposal network improve their proposals by 71.2% on partial shapes. Project page: https://bipashasen.github.io/scarp △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: Accepted at ICRA 2023

arXiv:2212.02493 [pdf, other]

Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields

Authors: Rohith Agaram, Shaurya Dewan, Rahul Sajnani, Adrien Poulenard, Madhava Krishna, Srinath Sridhar

Abstract: Coordinate-based implicit neural networks, or neural fields, have emerged as useful representations of shape and appearance in 3D computer vision. Despite advances, however, it remains challenging to build neural fields for categories of objects without datasets like ShapeNet that provide "canonicalized" object instances that are consistently aligned for their 3D position and orientation (pose). W… ▽ More Coordinate-based implicit neural networks, or neural fields, have emerged as useful representations of shape and appearance in 3D computer vision. Despite advances, however, it remains challenging to build neural fields for categories of objects without datasets like ShapeNet that provide "canonicalized" object instances that are consistently aligned for their 3D position and orientation (pose). We present Canonical Field Network (CaFi-Net), a self-supervised method to canonicalize the 3D pose of instances from an object category represented as neural fields, specifically neural radiance fields (NeRFs). CaFi-Net directly learns from continuous and noisy radiance fields using a Siamese network architecture that is designed to extract equivariant field features for category-level canonicalization. During inference, our method takes pre-trained neural radiance fields of novel object instances at arbitrary 3D pose and estimates a canonical field with consistent 3D pose across the entire category. Extensive experiments on a new dataset of 1300 NeRF models across 13 object categories show that our method matches or exceeds the performance of 3D point cloud-based methods. △ Less

Submitted 17 May, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.16882 [pdf, other]

MVRackLay: Monocular Multi-View Layout Estimation for Warehouse Racks and Shelves

Authors: Pranjali Pathre, Anurag Sahu, Ashwin Rao, Avinash Prabhu, Meher Shashwat Nigam, Tanvi Karandikar, Harit Pandya, K. Madhava Krishna

Abstract: In this paper, we propose and showcase, for the first time, monocular multi-view layout estimation for warehouse racks and shelves. Unlike typical layout estimation methods, MVRackLay estimates multi-layered layouts, wherein each layer corresponds to the layout of a shelf within a rack. Given a sequence of images of a warehouse scene, a dual-headed Convolutional-LSTM architecture outputs segmented… ▽ More In this paper, we propose and showcase, for the first time, monocular multi-view layout estimation for warehouse racks and shelves. Unlike typical layout estimation methods, MVRackLay estimates multi-layered layouts, wherein each layer corresponds to the layout of a shelf within a rack. Given a sequence of images of a warehouse scene, a dual-headed Convolutional-LSTM architecture outputs segmented racks, the front and the top view layout of each shelf within a rack. With minimal effort, such an output is transformed into a 3D rendering of all racks, shelves and objects on the shelves, giving an accurate 3D depiction of the entire warehouse scene in terms of racks, shelves and the number of objects on each shelf. MVRackLay generalizes to a diverse set of warehouse scenes with varying number of objects on each shelf, number of shelves and in the presence of other such racks in the background. Further, MVRackLay shows superior performance vis-a-vis its single view counterpart, RackLay, in layout accuracy, quantized in terms of the mean IoU and mAP metrics. We also showcase a multi-view stitching of the 3D layouts resulting in a representation of the warehouse scene with respect to a global reference frame akin to a rendering of the scene from a SLAM pipeline. To the best of our knowledge, this is the first such work to portray a 3D rendering of a warehouse scene in terms of its semantic components - Racks, Shelves and Objects - all from a single monocular camera. △ Less

Submitted 30 November, 2022; originally announced November 2022.

Journal ref: IEEE International Conference on Robotics and Biomimetics (ROBIO) 2022

arXiv:2210.10005 [pdf, other]

Otsu based Differential Evolution Method for Image Segmentation

Authors: Afreen Shaikh, Sharmila Botcha, Murali Krishna

Abstract: This paper proposes an OTSU based differential evolution method for satellite image segmentation and compares it with four other methods such as Modified Artificial Bee Colony Optimizer (MABC), Artificial Bee Colony (ABC), Genetic Algorithm (GA), and Particle Swarm Optimization (PSO) using the objective function proposed by Otsu for optimal multilevel thresholding. The experiments conducted and th… ▽ More This paper proposes an OTSU based differential evolution method for satellite image segmentation and compares it with four other methods such as Modified Artificial Bee Colony Optimizer (MABC), Artificial Bee Colony (ABC), Genetic Algorithm (GA), and Particle Swarm Optimization (PSO) using the objective function proposed by Otsu for optimal multilevel thresholding. The experiments conducted and their results illustrate that our proposed DE and OTSU algorithm segmentation can effectively and precisely segment the input image, close to results obtained by the other methods. In the proposed DE and OTSU algorithm, instead of passing the fitness function variables, the entire image is passed as an input to the DE algorithm after obtaining the threshold values for the input number of levels in the OTSU algorithm. The image segmentation results are obtained after learning about the image instead of learning about the fitness variables. In comparison to other segmentation methods examined, the proposed DE and OTSU algorithm yields promising results with minimized computational time compared to some algorithms. △ Less

Submitted 18 October, 2022; originally announced October 2022.

ACM Class: I.2.10; I.4.6

arXiv:2210.07062 [pdf, ps, other]

Non-Archimedean Welch Bounds and Non-Archimedean Zauner Conjecture

Authors: K. Mahesh Krishna

Abstract: Let $\mathbb{K}$ be a non-Archimedean (complete) valued field satisfying \begin{align*} \left|\sum_{j=1}^{n}λ_j^2\right|=\max_{1\leq j \leq n}|λ_j|^2, \quad \forall λ_j \in \mathbb{K}, 1\leq j \leq n, \forall n \in \mathbb{N}. \end{align*} For $d\in \mathbb{N}$, let $\mathbb{K}^d$ be the standard $d$-dimensional non-Archimedean Hilbert space. Let $m \in \mathbb{N}$ and… ▽ More Let $\mathbb{K}$ be a non-Archimedean (complete) valued field satisfying \begin{align*} \left|\sum_{j=1}^{n}λ_j^2\right|=\max_{1\leq j \leq n}|λ_j|^2, \quad \forall λ_j \in \mathbb{K}, 1\leq j \leq n, \forall n \in \mathbb{N}. \end{align*} For $d\in \mathbb{N}$, let $\mathbb{K}^d$ be the standard $d$-dimensional non-Archimedean Hilbert space. Let $m \in \mathbb{N}$ and $\text{Sym}^m(\mathbb{K}^d)$ be the non-Archimedean Hilbert space of symmetric m-tensors. We prove the following result. If $\{τ_j\}_{j=1}^n$ is a collection in $\mathbb{K}^d$ satisfying $\langle τ_j, τ_j\rangle =1$ for all $1\leq j \leq n$ and the operator $\text{Sym}^m(\mathbb{K}^d)\ni x \mapsto \sum_{j=1}^n\langle x, τ_j^{\otimes m}\rangle τ_j^{\otimes m} \in \text{Sym}^m(\mathbb{K}^d)$ is diagonalizable, then \begin{align} (1) \quad \quad \quad \max_{1\leq j,k \leq n, j \neq k}\{|n|, |\langle τ_j, τ_k\rangle|^{2m} \}\geq \frac{|n|^2}{\left|{d+m-1 \choose m}\right| }. \end{align} We call Inequality (1) as the non-Archimedean version of Welch bounds obtained by Welch [\textit{IEEE Transactions on Information Theory, 1974}]. We formulate non-Archimedean Zauner conjecture. △ Less

Submitted 28 August, 2022; originally announced October 2022.

Comments: 9 Pages, 0 Figures

MSC Class: 12J25; 46S10; 47S10

arXiv:2209.14922 [pdf, other]

GDIP: Gated Differentiable Image Processing for Object-Detection in Adverse Conditions

Authors: Sanket Kalwar, Dhruv Patel, Aakash Aanegola, Krishna Reddy Konda, Sourav Garg, K Madhava Krishna

Abstract: Detecting objects under adverse weather and lighting conditions is crucial for the safe and continuous operation of an autonomous vehicle, and remains an unsolved problem. We present a Gated Differentiable Image Processing (GDIP) block, a domain-agnostic network architecture, which can be plugged into existing object detection networks (e.g., Yolo) and trained end-to-end with adverse condition ima… ▽ More Detecting objects under adverse weather and lighting conditions is crucial for the safe and continuous operation of an autonomous vehicle, and remains an unsolved problem. We present a Gated Differentiable Image Processing (GDIP) block, a domain-agnostic network architecture, which can be plugged into existing object detection networks (e.g., Yolo) and trained end-to-end with adverse condition images such as those captured under fog and low lighting. Our proposed GDIP block learns to enhance images directly through the downstream object detection loss. This is achieved by learning parameters of multiple image pre-processing (IP) techniques that operate concurrently, with their outputs combined using weights learned through a novel gating mechanism. We further improve GDIP through a multi-stage guidance procedure for progressive image enhancement. Finally, trading off accuracy for speed, we propose a variant of GDIP that can be used as a regularizer for training Yolo, which eliminates the need for GDIP-based image enhancement during inference, resulting in higher throughput and plausible real-world deployment. We demonstrate significant improvement in detection performance over several state-of-the-art methods through quantitative and qualitative studies on synthetic datasets such as PascalVOC, and real-world foggy (RTTS) and low-lighting (ExDark) datasets. △ Less

Submitted 29 September, 2022; originally announced September 2022.

Comments: Submitted to ICRA2023. More information at https://gatedip.github.io

arXiv:2209.13418 [pdf, other]

UAV-based Visual Remote Sensing for Automated Building Inspection

Authors: Kushagra Srivastava, Dhruv Patel, Aditya Kumar Jha, Mohhit Kumar Jha, Jaskirat Singh, Ravi Kiran Sarvadevabhatla, Pradeep Kumar Ramancharla, Harikumar Kandath, K. Madhava Krishna

Abstract: Unmanned Aerial Vehicle (UAV) based remote sensing system incorporated with computer vision has demonstrated potential for assisting building construction and in disaster management like damage assessment during earthquakes. The vulnerability of a building to earthquake can be assessed through inspection that takes into account the expected damage progression of the associated component and the co… ▽ More Unmanned Aerial Vehicle (UAV) based remote sensing system incorporated with computer vision has demonstrated potential for assisting building construction and in disaster management like damage assessment during earthquakes. The vulnerability of a building to earthquake can be assessed through inspection that takes into account the expected damage progression of the associated component and the component's contribution to structural system performance. Most of these inspections are done manually, leading to high utilization of manpower, time, and cost. This paper proposes a methodology to automate these inspections through UAV-based image data collection and a software library for post-processing that helps in estimating the seismic structural parameters. The key parameters considered here are the distances between adjacent buildings, building plan-shape, building plan area, objects on the rooftop and rooftop layout. The accuracy of the proposed methodology in estimating the above-mentioned parameters is verified through field measurements taken using a distance measuring sensor and also from the data obtained through Google Earth. Additional details and code can be accessed from https://uvrsabi.github.io/ . △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: Paper accepted at CVCIE Workshop at ECCV, 2022 and the project page is https://uvrsabi.github.io/

arXiv:2209.11972 [pdf, other]

Ground then Navigate: Language-guided Navigation in Dynamic Scenes

Authors: Kanishk Jain, Varun Chhangani, Amogh Tiwari, K. Madhava Krishna, Vineet Gandhi

Abstract: We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. At each timestamp, the model predicts a segmentation mask corresponding to the intermediate or the final navigable region. Our work contrasts with existing efforts in VLN, w… ▽ More We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. At each timestamp, the model predicts a segmentation mask corresponding to the intermediate or the final navigable region. Our work contrasts with existing efforts in VLN, which pose this task as a node selection problem, given a discrete connected graph corresponding to the environment. We do not assume the availability of such a discretised map. Our work moves towards continuity in action space, provides interpretability through visual feedback and allows VLN on commands requiring finer manoeuvres like "park between the two cars". Furthermore, we propose a novel meta-dataset CARLA-NAV to allow efficient training and validation. The dataset comprises pre-recorded training sequences and a live environment for validation and testing. We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach. △ Less

Submitted 24 September, 2022; originally announced September 2022.

arXiv:2209.04805 [pdf, other]

Real-Time Heuristic Framework for Safe Landing of UAVs in Dynamic Scenarios

Authors: Jaskirat Singh, Neel Adwani, Harikumar Kandath, K. Madhava Krishna

Abstract: The world we live in is full of technology and with each passing day the advancement and usage of UAVs increases efficiently. As a result of the many application scenarios, there are some missions where the UAVs are vulnerable to external disruptions, such as a ground station's loss of connectivity, security missions, safety concerns, and delivery-related missions. Therefore, depending on the scen… ▽ More The world we live in is full of technology and with each passing day the advancement and usage of UAVs increases efficiently. As a result of the many application scenarios, there are some missions where the UAVs are vulnerable to external disruptions, such as a ground station's loss of connectivity, security missions, safety concerns, and delivery-related missions. Therefore, depending on the scenario, this could affect the operations and result in the safe landing of UAVs. Hence, this paper presents a heuristic approach towards safe landing of multi-rotor UAVs in the dynamic environments. The aim of this approach is to detect safe potential landing zones - PLZ, and find out the best one to land in. The PLZ is initially, detected by processing an image through the canny edge algorithm, and then the diameter-area estimation is applied for each region with minimal edges. The spots that have a higher area than the vehicle's clearance are labeled as safe PLZ. Onto the second phase of this approach, the velocities of dynamic obstacles that are moving towards the PLZs are calculated and their time to reach the zones are taken into consideration. The ETA of the UAV is calculated and during the descending of UAV, the dynamic obstacle avoidance is executed. The approach tested on the real-world environments have shown better results from existing work. △ Less

Submitted 11 September, 2022; originally announced September 2022.

Comments: 8 pages, 6 figures, 36 references

arXiv:2208.13031 [pdf, other]

Spatial Relation Graph and Graph Convolutional Network for Object Goal Navigation

Authors: D. A. Sasi Kiran, Kritika Anand, Chaitanya Kharyal, Gulshan Kumar, Nandiraju Gireesh, Snehasis Banerjee, Ruddra dev Roychoudhury, Mohan Sridharan, Brojeshwar Bhowmick, Madhava Krishna

Abstract: This paper describes a framework for the object-goal navigation task, which requires a robot to find and move to the closest instance of a target object class from a random starting position. The framework uses a history of robot trajectories to learn a Spatial Relational Graph (SRG) and Graph Convolutional Network (GCN)-based embeddings for the likelihood of proximity of different semantically-la… ▽ More This paper describes a framework for the object-goal navigation task, which requires a robot to find and move to the closest instance of a target object class from a random starting position. The framework uses a history of robot trajectories to learn a Spatial Relational Graph (SRG) and Graph Convolutional Network (GCN)-based embeddings for the likelihood of proximity of different semantically-labeled regions and the occurrence of different object classes in these regions. To locate a target object instance during evaluation, the robot uses Bayesian inference and the SRG to estimate the visible regions, and uses the learned GCN embeddings to rank visible regions and select the region to explore next. △ Less

Submitted 27 August, 2022; originally announced August 2022.

Comments: CASE 2022 paper

arXiv:2208.13009 [pdf, other]

Object Goal Navigation using Data Regularized Q-Learning

Authors: Nandiraju Gireesh, D. A. Sasi Kiran, Snehasis Banerjee, Mohan Sridharan, Brojeshwar Bhowmick, Madhava Krishna

Abstract: Object Goal Navigation requires a robot to find and navigate to an instance of a target object class in a previously unseen environment. Our framework incrementally builds a semantic map of the environment over time, and then repeatedly selects a long-term goal ('where to go') based on the semantic map to locate the target object instance. Long-term goal selection is formulated as a vision-based d… ▽ More Object Goal Navigation requires a robot to find and navigate to an instance of a target object class in a previously unseen environment. Our framework incrementally builds a semantic map of the environment over time, and then repeatedly selects a long-term goal ('where to go') based on the semantic map to locate the target object instance. Long-term goal selection is formulated as a vision-based deep reinforcement learning problem. Specifically, an Encoder Network is trained to extract high-level features from a semantic map and select a long-term goal. In addition, we incorporate data augmentation and Q-function regularization to make the long-term goal selection more effective. We report experimental results using the photo-realistic Gibson benchmark dataset in the AI Habitat 3D simulation environment to demonstrate substantial performance improvement on standard measures in comparison with a state of the art data-driven baseline. △ Less

Submitted 27 August, 2022; originally announced August 2022.

Comments: CASE 2022 paper

arXiv:2208.03038 [pdf, other]

Leveraging Distributional Bias for Reactive Collision Avoidance under Uncertainty: A Kernel Embedding Approach

Authors: Anish Gupta, Arun Kumar Singh, K. Madhava Krishna

Abstract: Many commodity sensors that measure the robot and dynamic obstacle's state have non-Gaussian noise characteristics. Yet, many current approaches treat the underlying-uncertainty in motion and perception as Gaussian, primarily to ensure computational tractability. On the other hand, existing planners working with non-Gaussian uncertainty do not shed light on leveraging distributional characteristic… ▽ More Many commodity sensors that measure the robot and dynamic obstacle's state have non-Gaussian noise characteristics. Yet, many current approaches treat the underlying-uncertainty in motion and perception as Gaussian, primarily to ensure computational tractability. On the other hand, existing planners working with non-Gaussian uncertainty do not shed light on leveraging distributional characteristics of motion and perception noise, such as bias for efficient collision avoidance. This paper fills this gap by interpreting reactive collision avoidance as a distribution matching problem between the collision constraint violations and Dirac Delta distribution. To ensure fast reactivity in the planner, we embed each distribution in Reproducing Kernel Hilbert Space and reformulate the distribution matching as minimizing the Maximum Mean Discrepancy (MMD) between the two distributions. We show that evaluating the MMD for a given control input boils down to just matrix-matrix products. We leverage this insight to develop a simple control sampling approach for reactive collision avoidance with dynamic and uncertain obstacles. We advance the state-of-the-art in two respects. First, we conduct an extensive empirical study to show that our planner can infer distributional bias from sample-level information. Consequently, it uses this insight to guide the robot to good homotopy. We also highlight how a Gaussian approximation of the underlying uncertainty can lose the bias estimate and guide the robot to unfavorable states with a high collision probability. Second, we show tangible comparative advantages of the proposed distribution matching approach for collision avoidance with previous non-parametric and Gaussian approximated methods of reactive collision avoidance. △ Less

Submitted 22 September, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

arXiv:2207.03557 [pdf, other]

Flow Synthesis Based Visual Servoing Frameworks for Monocular Obstacle Avoidance Amidst High-Rises

Authors: Harshit K. Sankhla, M. Nomaan Qureshi, Shankara Narayanan V., Vedansh Mittal, Gunjan Gupta, Harit Pandya, K. Madhava Krishna

Abstract: We propose a novel flow synthesis based visual servoing framework enabling long-range obstacle avoidance for Micro Air Vehicles (MAV) flying amongst tall skyscrapers. Recent deep learning based frameworks use optical flow to do high-precision visual servoing. In this paper, we explore the question: can we design a surrogate flow for these high-precision visual-servoing methods, which leads to obst… ▽ More We propose a novel flow synthesis based visual servoing framework enabling long-range obstacle avoidance for Micro Air Vehicles (MAV) flying amongst tall skyscrapers. Recent deep learning based frameworks use optical flow to do high-precision visual servoing. In this paper, we explore the question: can we design a surrogate flow for these high-precision visual-servoing methods, which leads to obstacle avoidance? We revisit the concept of saliency for identifying high-rise structures in/close to the line of attack amongst other competing skyscrapers and buildings as a collision obstacle. A synthesised flow is used to displace the salient object segmentation mask. This flow is so computed that the visual servoing controller maneuvers the MAV safely around the obstacle. In this approach, we use a multi-step Cross-Entropy Method (CEM) based servo control to achieve flow convergence, resulting in obstacle avoidance. We use this novel pipeline to successfully and persistently maneuver high-rises and reach the goal in simulated and photo-realistic real-world scenes. We conduct extensive experimentation and compare our approach with optical flow and short-range depth-based obstacle avoidance methods to demonstrate the proposed framework's merit. Additional Visualisation can be found at https://sites.google.com/view/monocular-obstacle/home △ Less

Submitted 7 July, 2022; originally announced July 2022.

Comments: Accepted to IEEE International Conference on Automation Science and Engineering (CASE), 2022

arXiv:2205.04090 [pdf, other]

Approaches and Challenges in Robotic Perception for Table-top Rearrangement and Planning

Authors: Aditya Agarwal, Bipasha Sen, Shankara Narayanan V, Vishal Reddy Mandadi, Brojeshwar Bhowmick, K Madhava Krishna

Abstract: Table-top Rearrangement and Planning is a challenging problem that relies heavily on an excellent perception stack. The perception stack involves observing and registering the 3D scene on the table, detecting what objects are on the table, and how to manipulate them. Consequently, it greatly influences the system's task-planning and motion-planning stacks that follow. We present a comprehensive ov… ▽ More Table-top Rearrangement and Planning is a challenging problem that relies heavily on an excellent perception stack. The perception stack involves observing and registering the 3D scene on the table, detecting what objects are on the table, and how to manipulate them. Consequently, it greatly influences the system's task-planning and motion-planning stacks that follow. We present a comprehensive overview and discuss the different challenges associated with the perception module. This work is a result of our extensive involvement in the ICRA 2022 Open Cloud Robot Table Organization Challenge, in which we stood third in the final rankings. △ Less

Submitted 3 June, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

Comments: 5 pages including references, 3 figures

arXiv:2204.00865 [pdf, other]

UrbanFly: Uncertainty-Aware Planning for Navigation Amongst High-Rises with Monocular Visual-Inertial SLAM Maps

Authors: Sudarshan S Harithas, Ayyappa Swamy Thatavarthy, Gurkirat Singh, Arun K Singh, K Madhava Krishna

Abstract: We present UrbanFly: an uncertainty-aware real-time planning framework for quadrotor navigation in urban high-rise environments. A core aspect of UrbanFly is its ability to robustly plan directly on the sparse point clouds generated by a Monocular Visual Inertial SLAM (VINS) backend. It achieves this by using the sparse point clouds to build an uncertainty-integrated cuboid representation of the e… ▽ More We present UrbanFly: an uncertainty-aware real-time planning framework for quadrotor navigation in urban high-rise environments. A core aspect of UrbanFly is its ability to robustly plan directly on the sparse point clouds generated by a Monocular Visual Inertial SLAM (VINS) backend. It achieves this by using the sparse point clouds to build an uncertainty-integrated cuboid representation of the environment through a data-driven monocular plane segmentation network. Our chosen world model provides faster distance queries than the more common voxel-grid representation, and UrbanFly leverages this capability in two different ways leading to two trajectory optimizers. The first optimizer uses a gradient-free cross-entropy method to compute trajectories that minimize collision probability and smoothness cost. Our second optimizer is a simplified version of the first and uses a sequential convex programming optimizer initialized based on probabilistic safety estimates on a set of randomly drawn trajectories. Both our trajectory optimizers are made computationally tractable and independent of the nature of underlying uncertainty by embedding the distribution of collision violations in Reproducing Kernel Hilbert Space. Empowered by the algorithmic innovation, UrbanFly outperforms competing baselines in metrics such as collision rate, trajectory length, etc., on a high-fidelity AirSim simulator augmented with synthetic and real-world dataset scenes. △ Less

Submitted 3 October, 2022; v1 submitted 2 April, 2022; originally announced April 2022.

Comments: Submitted to ACC 2023, Code available at https://github.com/sudarshan-s-harithas/UrbanFly

Showing 1–50 of 134 results for author: Krishna, M