Search | arXiv e-print repository

doi 10.1049/PBPC027E_ch5

SLA Conceptual Model for IoT Applications

Authors: Awatif Alqahtani, Ellis Solaiman, Rajiv Ranjan

Abstract: Since SLAs specify the contractual terms that are formally used between consumers and providers, there is a need to aggregate QoS requirements from the perspectives of Clouds, networks, and devices to deliver the promised IoT functionalities. Therefore, the main objective of this chapter is to provide a conceptual model of SLA for the IoT as well as rich vocabularies to describe the QoS and domain… ▽ More Since SLAs specify the contractual terms that are formally used between consumers and providers, there is a need to aggregate QoS requirements from the perspectives of Clouds, networks, and devices to deliver the promised IoT functionalities. Therefore, the main objective of this chapter is to provide a conceptual model of SLA for the IoT as well as rich vocabularies to describe the QoS and domain-specific configuration parameters of the IoT on an end-to-end basis. We first propose a conceptual model that identifies the main concepts that play a role in specifying end-to-end SLAs. Then, we identify some of the most common QoS metrics and configuration parameters related to each concept. We evaluated the proposed conceptual model using a goal-oriented approach, and the participants in the study reported a high level of satisfaction regarding the proposed conceptual model and its ability to capture main concepts in a general way. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Journal ref: Managing Internet of Things Applications across Edge and Cloud Data Centres, IET, 2024

arXiv:2408.13102 [pdf, other]

Dynamic Label Adversarial Training for Deep Learning Robustness Against Adversarial Attacks

Authors: Zhenyu Liu, Haoran Duan, Huizhi Liang, Yang Long, Vaclav Snasel, Guiseppe Nicosia, Rajiv Ranjan, Varun Ojha

Abstract: Adversarial training is one of the most effective methods for enhancing model robustness. Recent approaches incorporate adversarial distillation in adversarial training architectures. However, we notice two scenarios of defense methods that limit their performance: (1) Previous methods primarily use static ground truth for adversarial training, but this often causes robust overfitting; (2) The los… ▽ More Adversarial training is one of the most effective methods for enhancing model robustness. Recent approaches incorporate adversarial distillation in adversarial training architectures. However, we notice two scenarios of defense methods that limit their performance: (1) Previous methods primarily use static ground truth for adversarial training, but this often causes robust overfitting; (2) The loss functions are either Mean Squared Error or KL-divergence leading to a sub-optimal performance on clean accuracy. To solve those problems, we propose a dynamic label adversarial training (DYNAT) algorithm that enables the target model to gradually and dynamically gain robustness from the guide model's decisions. Additionally, we found that a budgeted dimension of inner optimization for the target model may contribute to the trade-off between clean accuracy and robust accuracy. Therefore, we propose a novel inner optimization method to be incorporated into the adversarial training. This will enable the target model to adaptively search for adversarial examples based on dynamic labels from the guiding model, contributing to the robustness of the target model. Extensive experiments validate the superior performance of our approach. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Journal ref: 31st International Conference on Neural Information Processing (ICONIP), 2024

arXiv:2408.10752 [pdf, other]

Security Assessment of Hierarchical Federated Deep Learning

Authors: D Alqattan, R Sun, H Liang, G Nicosia, V Snasel, R Ranjan, V Ojha

Abstract: Hierarchical federated learning (HFL) is a promising distributed deep learning model training paradigm, but it has crucial security concerns arising from adversarial attacks. This research investigates and assesses the security of HFL using a novel methodology by focusing on its resilience against adversarial attacks inference-time and training-time. Through a series of extensive experiments acros… ▽ More Hierarchical federated learning (HFL) is a promising distributed deep learning model training paradigm, but it has crucial security concerns arising from adversarial attacks. This research investigates and assesses the security of HFL using a novel methodology by focusing on its resilience against adversarial attacks inference-time and training-time. Through a series of extensive experiments across diverse datasets and attack scenarios, we uncover that HFL demonstrates robustness against untargeted training-time attacks due to its hierarchical structure. However, targeted attacks, particularly backdoor attacks, exploit this architecture, especially when malicious clients are positioned in the overlapping coverage areas of edge servers. Consequently, HFL shows a dual nature in its resilience, showcasing its capability to recover from attacks thanks to its hierarchical aggregation that strengthens its suitability for adversarial training, thereby reinforcing its resistance against inference-time attacks. These insights underscore the necessity for balanced security strategies in HFL systems, leveraging their inherent strengths while effectively mitigating vulnerabilities. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Journal ref: 33rd International Conference on Artificial Neural Networks (ICANN) (2024)

arXiv:2407.20060 [pdf, other]

RelBench: A Benchmark for Deep Learning on Relational Databases

Authors: Joshua Robinson, Rishabh Ranjan, Weihua Hu, Kexin Huang, Jiaqi Han, Alejandro Dobles, Matthias Fey, Jan E. Lenssen, Yiwen Yuan, Zecheng Zhang, Xinwei He, Jure Leskovec

Abstract: We present RelBench, a public benchmark for solving predictive tasks over relational databases with graph neural networks. RelBench provides databases and tasks spanning diverse domains and scales, and is intended to be a foundational infrastructure for future research. We use RelBench to conduct the first comprehensive study of Relational Deep Learning (RDL) (Fey et al., 2024), which combines gra… ▽ More We present RelBench, a public benchmark for solving predictive tasks over relational databases with graph neural networks. RelBench provides databases and tasks spanning diverse domains and scales, and is intended to be a foundational infrastructure for future research. We use RelBench to conduct the first comprehensive study of Relational Deep Learning (RDL) (Fey et al., 2024), which combines graph neural network predictive models with (deep) tabular models that extract initial entity-level representations from raw tables. End-to-end learned RDL models fully exploit the predictive signal encoded in primary-foreign key links, marking a significant shift away from the dominant paradigm of manual feature engineering combined with tabular models. To thoroughly evaluate RDL against this prior gold-standard, we conduct an in-depth user study where an experienced data scientist manually engineers features for each task. In this study, RDL learns better models whilst reducing human work needed by more than an order of magnitude. This demonstrates the power of deep learning for solving predictive tasks over relational databases, opening up many new research opportunities enabled by RelBench. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.12801 [pdf]

Evaluation of LLMs Biases Towards Elite Universities: A Persona-Based Exploration

Authors: Shailja Gupta, Rajesh Ranjan

Abstract: This study investigates whether popular LLMs exhibit bias towards elite universities when generating personas for technology industry professionals. We employed a novel persona-based approach to compare the educational background predictions of GPT-3.5, Gemini, and Claude 3 Sonnet with actual data from LinkedIn. The study focused on various roles at Microsoft, Meta, and Google, including VP Produc… ▽ More This study investigates whether popular LLMs exhibit bias towards elite universities when generating personas for technology industry professionals. We employed a novel persona-based approach to compare the educational background predictions of GPT-3.5, Gemini, and Claude 3 Sonnet with actual data from LinkedIn. The study focused on various roles at Microsoft, Meta, and Google, including VP Product, Director of Engineering, and Software Engineer. We generated 432 personas across the three LLMs and analyzed the frequency of elite universities (Stanford, MIT, UC Berkeley, and Harvard) in these personas compared to LinkedIn data. Results showed that LLMs significantly overrepresented elite universities, featuring these universities 72.45% of the time, compared to only 8.56% in the actual LinkedIn data. ChatGPT 3.5 exhibited the highest bias, followed by Claude Sonnet 3, while Gemini performed best. This research highlights the need to address educational bias in LLMs and suggests strategies for mitigating such biases in AI-driven recruitment processes. △ Less

Submitted 27 July, 2024; v1 submitted 24 June, 2024; originally announced July 2024.

Comments: 10 pages, 4 Figures

arXiv:2407.07254 [pdf, other]

HAMIL-QA: Hierarchical Approach to Multiple Instance Learning for Atrial LGE MRI Quality Assessment

Authors: K M Arefeen Sultan, Md Hasibul Husain Hisham, Benjamin Orkild, Alan Morris, Eugene Kholmovski, Erik Bieging, Eugene Kwan, Ravi Ranjan, Ed DiBella, Shireen Elhabian

Abstract: The accurate evaluation of left atrial fibrosis via high-quality 3D Late Gadolinium Enhancement (LGE) MRI is crucial for atrial fibrillation management but is hindered by factors like patient movement and imaging variability. The pursuit of automated LGE MRI quality assessment is critical for enhancing diagnostic accuracy, standardizing evaluations, and improving patient outcomes. The deep learnin… ▽ More The accurate evaluation of left atrial fibrosis via high-quality 3D Late Gadolinium Enhancement (LGE) MRI is crucial for atrial fibrillation management but is hindered by factors like patient movement and imaging variability. The pursuit of automated LGE MRI quality assessment is critical for enhancing diagnostic accuracy, standardizing evaluations, and improving patient outcomes. The deep learning models aimed at automating this process face significant challenges due to the scarcity of expert annotations, high computational costs, and the need to capture subtle diagnostic details in highly variable images. This study introduces HAMIL-QA, a multiple instance learning (MIL) framework, designed to overcome these obstacles. HAMIL-QA employs a hierarchical bag and sub-bag structure that allows for targeted analysis within sub-bags and aggregates insights at the volume level. This hierarchical MIL approach reduces reliance on extensive annotations, lessens computational load, and ensures clinically relevant quality predictions by focusing on diagnostically critical image features. Our experiments show that HAMIL-QA surpasses existing MIL methods and traditional supervised approaches in accuracy, AUROC, and F1-Score on an LGE MRI scan dataset, demonstrating its potential as a scalable solution for LGE MRI quality assessment automation. The code is available at: $\href{https://github.com/arf111/HAMIL-QA}{\text{this https URL}}$ △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Accepted to MICCAI2024, 10 pages, 2 figures

arXiv:2407.04190 [pdf, other]

Computer Vision for Clinical Gait Analysis: A Gait Abnormality Video Dataset

Authors: Rahm Ranjan, David Ahmedt-Aristizabal, Mohammad Ali Armin, Juno Kim

Abstract: Clinical gait analysis (CGA) using computer vision is an emerging field in artificial intelligence that faces barriers of accessible, real-world data, and clear task objectives. This paper lays the foundation for current developments in CGA as well as vision-based methods and datasets suitable for gait analysis. We introduce The Gait Abnormality in Video Dataset (GAVD) in response to our review of… ▽ More Clinical gait analysis (CGA) using computer vision is an emerging field in artificial intelligence that faces barriers of accessible, real-world data, and clear task objectives. This paper lays the foundation for current developments in CGA as well as vision-based methods and datasets suitable for gait analysis. We introduce The Gait Abnormality in Video Dataset (GAVD) in response to our review of over 150 current gait-related computer vision datasets, which highlighted the need for a large and accessible gait dataset clinically annotated for CGA. GAVD stands out as the largest video gait dataset, comprising 1874 sequences of normal, abnormal and pathological gaits. Additionally, GAVD includes clinically annotated RGB data sourced from publicly available content on online platforms. It also encompasses over 400 subjects who have undergone clinical grade visual screening to represent a diverse range of abnormal gait patterns, captured in various settings, including hospital clinics and urban uncontrolled outdoor environments. We demonstrate the validity of the dataset and utility of action recognition models for CGA using pretrained models Temporal Segment Networks(TSN) and SlowFast network to achieve video abnormality detection of 94% and 92% respectively when tested on GAVD dataset. A GitHub repository https://github.com/Rahmyyy/GAVD consisting of convenient URL links, and clinically relevant annotation for CGA is provided for over 450 online videos, featuring diverse subjects performing a range of normal, pathological, and abnormal gait patterns. △ Less

Submitted 4 July, 2024; originally announced July 2024.

ACM Class: I.2.10

arXiv:2406.02535 [pdf, other]

Enhancing 2D Representation Learning with a 3D Prior

Authors: Mehmet Aygün, Prithviraj Dhar, Zhicheng Yan, Oisin Mac Aodha, Rakesh Ranjan

Abstract: Learning robust and effective representations of visual data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data by learning representations from raw unlabeled visual data alone. However, unlike humans who obtain rich 3D infor… ▽ More Learning robust and effective representations of visual data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data by learning representations from raw unlabeled visual data alone. However, unlike humans who obtain rich 3D information from their binocular vision and through motion, the majority of current self-supervised methods are tasked with learning from monocular 2D image collections. This is noteworthy as it has been demonstrated that shape-centric visual processing is more robust compared to texture-biased automated methods. Inspired by this, we propose a new approach for strengthening existing self-supervised methods by explicitly enforcing a strong 3D structural prior directly into the model during training. Through experiments, across a range of datasets, we demonstrate that our 3D aware representations are more robust compared to conventional self-supervised baselines. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.20008 [pdf, other]

Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Authors: Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu Sebe

Abstract: Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the Vision Transformers (ViTs) emergence has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objec… ▽ More Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the Vision Transformers (ViTs) emergence has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (i.e., SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR's state-of-the-art performance, quantitatively and qualitatively showcasing advancements. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 9 pages

arXiv:2405.15962 [pdf]

doi 10.1016/j.ins.2024.120393

Wearable-based behaviour interpolation for semi-supervised human activity recognition

Authors: Haoran Duan, Shidong Wang, Varun Ojha, Shizheng Wang, Yawen Huang, Yang Long, Rajiv Ranjan, Yefeng Zheng

Abstract: While traditional feature engineering for Human Activity Recognition (HAR) involves a trial-anderror process, deep learning has emerged as a preferred method for high-level representations of sensor-based human activities. However, most deep learning-based HAR requires a large amount of labelled data and extracting HAR features from unlabelled data for effective deep learning training remains chal… ▽ More While traditional feature engineering for Human Activity Recognition (HAR) involves a trial-anderror process, deep learning has emerged as a preferred method for high-level representations of sensor-based human activities. However, most deep learning-based HAR requires a large amount of labelled data and extracting HAR features from unlabelled data for effective deep learning training remains challenging. We, therefore, introduce a deep semi-supervised HAR approach, MixHAR, which concurrently uses labelled and unlabelled activities. Our MixHAR employs a linear interpolation mechanism to blend labelled and unlabelled activities while addressing both inter- and intra-activity variability. A unique challenge identified is the activityintrusion problem during mixing, for which we propose a mixing calibration mechanism to mitigate it in the feature embedding space. Additionally, we rigorously explored and evaluated the five conventional/popular deep semi-supervised technologies on HAR, acting as the benchmark of deep semi-supervised HAR. Our results demonstrate that MixHAR significantly improves performance, underscoring the potential of deep semi-supervised techniques in HAR. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.15914 [pdf, other]

ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching

Authors: Yumin Zhang, Xingyu Miao, Haoran Duan, Bo Wei, Tejal Shah, Yang Long, Rajiv Ranjan

Abstract: Text-to-3D content creation is a rapidly evolving research area. Given the scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion models for 3D synthesis. Among these approaches, Score Distillation Sampling (SDS) has been widely adopted. However, the issue of over-smoothing poses a significant limitation on the high-fidelity generation of 3D models. To address this challenge,… ▽ More Text-to-3D content creation is a rapidly evolving research area. Given the scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion models for 3D synthesis. Among these approaches, Score Distillation Sampling (SDS) has been widely adopted. However, the issue of over-smoothing poses a significant limitation on the high-fidelity generation of 3D models. To address this challenge, LucidDreamer replaces the Denoising Diffusion Probabilistic Model (DDPM) in SDS with the Denoising Diffusion Implicit Model (DDIM) to construct Interval Score Matching (ISM). However, ISM inevitably inherits inconsistencies from DDIM, causing reconstruction errors during the DDIM inversion process. This results in poor performance in the detailed generation of 3D objects and loss of content. To alleviate these problems, we propose a novel method named Exact Score Matching (ESM). Specifically, ESM leverages auxiliary variables to mathematically guarantee exact recovery in the DDIM reverse process. Furthermore, to effectively capture the dynamic changes of the original and auxiliary variables, the LoRA of a pre-trained diffusion model implements these exact paths. Extensive experiments demonstrate the effectiveness of ESM in text-to-3D generation, particularly highlighting its superiority in detailed generation. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.13900 [pdf, other]

Rehearsal-free Federated Domain-incremental Learning

Authors: Rui Sun, Haoran Duan, Jiahua Dong, Varun Ojha, Tejal Shah, Rajiv Ranjan

Abstract: We introduce a rehearsal-free federated domain incremental learning framework, RefFiL, based on a global prompt-sharing paradigm to alleviate catastrophic forgetting challenges in federated domain-incremental learning, where unseen domains are continually learned. Typical methods for mitigating forgetting, such as the use of additional datasets and the retention of private data from earlier tasks,… ▽ More We introduce a rehearsal-free federated domain incremental learning framework, RefFiL, based on a global prompt-sharing paradigm to alleviate catastrophic forgetting challenges in federated domain-incremental learning, where unseen domains are continually learned. Typical methods for mitigating forgetting, such as the use of additional datasets and the retention of private data from earlier tasks, are not viable in federated learning (FL) due to devices' limited resources. Our method, RefFiL, addresses this by learning domain-invariant knowledge and incorporating various domain-specific prompts from the domains represented by different FL participants. A key feature of RefFiL is the generation of local fine-grained prompts by our domain adaptive prompt generator, which effectively learns from local domain knowledge while maintaining distinctive boundaries on a global scale. We also introduce a domain-specific prompt contrastive learning loss that differentiates between locally generated prompts and those from other domains, enhancing RefFiL's precision and effectiveness. Compared to existing methods, RefFiL significantly alleviates catastrophic forgetting without requiring extra memory space, making it ideal for privacy-sensitive and resource-constrained devices. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.13370 [pdf, other]

Low-Resolution Chest X-ray Classification via Knowledge Distillation and Multi-task Learning

Authors: Yasmeena Akhter, Rishabh Ranjan, Richa Singh, Mayank Vatsa

Abstract: This research addresses the challenges of diagnosing chest X-rays (CXRs) at low resolutions, a common limitation in resource-constrained healthcare settings. High-resolution CXR imaging is crucial for identifying small but critical anomalies, such as nodules or opacities. However, when images are downsized for processing in Computer-Aided Diagnosis (CAD) systems, vital spatial details and receptiv… ▽ More This research addresses the challenges of diagnosing chest X-rays (CXRs) at low resolutions, a common limitation in resource-constrained healthcare settings. High-resolution CXR imaging is crucial for identifying small but critical anomalies, such as nodules or opacities. However, when images are downsized for processing in Computer-Aided Diagnosis (CAD) systems, vital spatial details and receptive fields are lost, hampering diagnosis accuracy. To address this, this paper presents the Multilevel Collaborative Attention Knowledge (MLCAK) method. This approach leverages the self-attention mechanism of Vision Transformers (ViT) to transfer critical diagnostic knowledge from high-resolution images to enhance the diagnostic efficacy of low-resolution CXRs. MLCAK incorporates local pathological findings to boost model explainability, enabling more accurate global predictions in a multi-task framework tailored for low-resolution CXR analysis. Our research, utilizing the Vindr CXR dataset, shows a considerable enhancement in the ability to diagnose diseases from low-resolution images (e.g. 28 x 28), suggesting a critical transition from the traditional reliance on high-resolution imaging (e.g. 224 x 224). △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: IEEE ISBI 2024

arXiv:2405.11252 [pdf, other]

Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching

Authors: Xingyu Miao, Haoran Duan, Varun Ojha, Jun Song, Tejal Shah, Yang Long, Rajiv Ranjan

Abstract: In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversi… ▽ More In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversion process of DDIM to generate two paths from the same starting point for calculation. Since both paths start from the same starting point, TSM can reduce the accumulated error compared to ISM, thus alleviating the problem of pseudo ground truth inconsistency. TSM enhances the stability and consistency of the model's generated paths during the distillation process. We demonstrate this experimentally and further show that ISM is a special case of TSM. Furthermore, to optimize the current multi-stage optimization process from high-resolution text to 3D generation, we adopt Stable Diffusion XL for guidance. In response to the issues of abnormal replication and splitting caused by unstable gradients during the 3D Gaussian splatting process when using Stable Diffusion XL, we propose a pixel-by-pixel gradient clipping method. Extensive experiments show that our model significantly surpasses the state-of-the-art models in terms of visual quality and performance. Code: \url{https://github.com/xingy038/Dreamer-XL}. △ Less

Submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.10674 [pdf, other]

From Sora What We Can See: A Survey of Text-to-Video Generation

Authors: Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoying Zhang, Wenqi Li, Haoran Duan, Bo Wei, Rajiv Ranjan

Abstract: With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark fr… ▽ More With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: A comprehensive list of text-to-video generation studies in this survey is available at https://github.com/soraw-ai/Awesome-Text-to-Video-Generation

arXiv:2405.02608 [pdf, other]

UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model

Authors: Shuai Yuan, Lei Luo, Zhuo Hui, Can Pu, Xiaoyu Xiang, Rakesh Ranjan, Denis Demandolx

Abstract: Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore, we propose UnSAMFlow, an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also ana… ▽ More Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore, we propose UnSAMFlow, an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also analyze the poor gradient landscapes of traditional smoothness losses and propose a new smoothness definition based on homography instead. A simple yet effective mask feature module has also been added to further aggregate features on the object level. With all these adaptations, our method produces clear optical flow estimation with sharp boundaries around objects, which outperforms state-of-the-art methods on both KITTI and Sintel datasets. Our method also generalizes well across domains and runs very efficiently. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: Accepted by CVPR 2024. Code is available at https://github.com/facebookresearch/UnSAMFlow

arXiv:2404.11481 [pdf, other]

doi 10.1002/spe.3084

IoTSim-Osmosis-RES: Towards autonomic renewable energy-aware osmotic computing

Authors: Tomasz Szydlo, Amadeusz Szabala, Nazar Kordiumov, Konrad Siuzdak, Lukasz Wolski, Khaled Alwasel, Fawzy Habeeb, Rajiv Ranjan

Abstract: Internet of Things systems exists in various areas of our everyday life. For example, sensors installed in smart cities and homes are processed in edge and cloud computing centres providing several benefits that improve our lives. The place of data processing is related to the required system response times -- processing data closer to its source results in a shorter system response time. The Osmo… ▽ More Internet of Things systems exists in various areas of our everyday life. For example, sensors installed in smart cities and homes are processed in edge and cloud computing centres providing several benefits that improve our lives. The place of data processing is related to the required system response times -- processing data closer to its source results in a shorter system response time. The Osmotic Computing concept enables flexible deployment of data processing services and their possible movement, just like particles in the osmosis phenomenon move between regions of different densities. At the same time, the impact of complex computer architecture on the environment is increasingly being compensated by the use of renewable and low-carbon energy sources. However, the uncertainty of supplying green energy makes the management of Osmotic Computing demanding, and therefore their autonomy is desirable. In the paper, we present a framework enabling osmotic computing simulation based on renewable energy sources and autonomic osmotic agents, allowing the analysis of distributed management algorithms. We discuss the challenges posed to the framework and analyze various management algorithms for cooperating osmotic agents. In the evaluation we show that changing the adaptation logic of the osmotic agents, it is possible to increase the self-consumption of renewable energy sources or increase the usage of low emission ones. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.07815 [pdf, other]

Post-Hoc Reversal: Are We Selecting Models Prematurely?

Authors: Rishabh Ranjan, Saurabh Garg, Mrigank Raman, Carlos Guestrin, Zachary Chase Lipton

Abstract: Trained models are often composed with post-hoc transforms such as temperature scaling (TS), ensembling and stochastic weight averaging (SWA) to improve performance, robustness, uncertainty estimation, etc. However, such transforms are typically applied only after the base models have already been finalized by standard means. In this paper, we challenge this practice with an extensive empirical st… ▽ More Trained models are often composed with post-hoc transforms such as temperature scaling (TS), ensembling and stochastic weight averaging (SWA) to improve performance, robustness, uncertainty estimation, etc. However, such transforms are typically applied only after the base models have already been finalized by standard means. In this paper, we challenge this practice with an extensive empirical study. In particular, we demonstrate a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying these post-hoc transforms. This phenomenon is especially prominent in high-noise settings. For example, while base models overfit badly early in training, both conventional ensembling and SWA favor base models trained for more epochs. Post-hoc reversal can also suppress the appearance of double descent and mitigate mismatches between test loss and test error seen in base models. Based on our findings, we propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions such as early stopping, checkpointing, and broader hyperparameter choices. Our experimental analyses span real-world vision, language, tabular and graph datasets from domains like satellite imaging, language modeling, census prediction and social network analysis. On an LLM instruction tuning dataset, post-hoc selection results in > 1.5x MMLU improvement compared to naive selection. Code is available at https://github.com/rishabh-ranjan/post-hoc-reversal. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 9 pages + references + appendix, 7 figures

arXiv:2404.00013 [pdf, other]

Missing Data Imputation With Granular Semantics and AI-driven Pipeline for Bankruptcy Prediction

Authors: Debarati Chakraborty, Ravi Ranjan

Abstract: This work focuses on designing a pipeline for the prediction of bankruptcy. The presence of missing values, high dimensional data, and highly class-imbalance databases are the major challenges in the said task. A new method for missing data imputation with granular semantics has been introduced here. The merits of granular computing have been explored here to define this method. The missing values… ▽ More This work focuses on designing a pipeline for the prediction of bankruptcy. The presence of missing values, high dimensional data, and highly class-imbalance databases are the major challenges in the said task. A new method for missing data imputation with granular semantics has been introduced here. The merits of granular computing have been explored here to define this method. The missing values have been predicted using the feature semantics and reliable observations in a low-dimensional space, in the granular space. The granules are formed around every missing entry, considering a few of the highly correlated features and most reliable closest observations to preserve the relevance and reliability, the context, of the database against the missing entries. An intergranular prediction is then carried out for the imputation within those contextual granules. That is, the contextual granules enable a small relevant fraction of the huge database to be used for imputation and overcome the need to access the entire database repetitively for each missing value. This method is then implemented and tested for the prediction of bankruptcy with the Polish Bankruptcy dataset. It provides an efficient solution for big and high-dimensional datasets even with large imputation rates. Then an AI-driven pipeline for bankruptcy prediction has been designed using the proposed granular semantic-based data filling method followed by the solutions to the issues like high dimensional dataset and high class-imbalance in the dataset. The rest of the pipeline consists of feature selection with the random forest for reducing dimensionality, data balancing with SMOTE, and prediction with six different popular classifiers including deep NN. All methods defined here have been experimentally verified with suitable comparative studies and proven to be effective on all the data sets captured over the five years. △ Less

Submitted 15 March, 2024; originally announced April 2024.

Comments: 15 pages

arXiv:2403.19495 [pdf, other]

CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians

Authors: Avinash Paliwal, Wei Ye, Jinhui Xiong, Dmytro Kotovenko, Rakesh Ranjan, Vikas Chandra, Nima Khademi Kalantari

Abstract: The field of 3D reconstruction from images has rapidly evolved in the past few years, first with the introduction of Neural Radiance Field (NeRF) and more recently with 3D Gaussian Splatting (3DGS). The latter provides a significant edge over NeRF in terms of the training and inference speed, as well as the reconstruction quality. Although 3DGS works well for dense input images, the unstructured p… ▽ More The field of 3D reconstruction from images has rapidly evolved in the past few years, first with the introduction of Neural Radiance Field (NeRF) and more recently with 3D Gaussian Splatting (3DGS). The latter provides a significant edge over NeRF in terms of the training and inference speed, as well as the reconstruction quality. Although 3DGS works well for dense input images, the unstructured point-cloud like representation quickly overfits to the more challenging setup of extremely sparse input images (e.g., 3 images), creating a representation that appears as a jumble of needles from novel views. To address this issue, we propose regularized optimization and depth-based initialization. Our key idea is to introduce a structured Gaussian representation that can be controlled in 2D image space. We then constraint the Gaussians, in particular their position, and prevent them from moving independently during optimization. Specifically, we introduce single and multiview constraints through an implicit convolutional decoder and a total variation loss, respectively. With the coherency introduced to the Gaussians, we further constrain the optimization through a flow-based loss function. To support our regularized optimization, we propose an approach to initialize the Gaussians using monocular depth estimates at each input view. We demonstrate significant improvements compared to the state-of-the-art sparse-view NeRF-based approaches on a variety of scenes. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Project page: https://people.engr.tamu.edu/nimak/Papers/CoherentGS

arXiv:2403.18816 [pdf, other]

Garment3DGen: 3D Garment Stylization and Texture Generation

Authors: Nikolaos Sarafianos, Tuur Stuyck, Xiaoyu Xiang, Yilei Li, Jovan Popovic, Rakesh Ranjan

Abstract: We introduce Garment3DGen a new method to synthesize 3D garment assets from a base mesh given a single input image as guidance. Our proposed approach allows users to generate 3D textured clothes based on both real and synthetic images, such as those generated by text prompts. The generated assets can be directly draped and simulated on human bodies. We leverage the recent progress of image-to-3D d… ▽ More We introduce Garment3DGen a new method to synthesize 3D garment assets from a base mesh given a single input image as guidance. Our proposed approach allows users to generate 3D textured clothes based on both real and synthetic images, such as those generated by text prompts. The generated assets can be directly draped and simulated on human bodies. We leverage the recent progress of image-to-3D diffusion methods to generate 3D garment geometries. However, since these geometries cannot be utilized directly for downstream tasks, we propose to use them as pseudo ground-truth and set up a mesh deformation optimization procedure that deforms a base template mesh to match the generated 3D target. Carefully designed losses allow the base mesh to freely deform towards the desired target, yet preserve mesh quality and topology such that they can be simulated. Finally, we generate high-fidelity texture maps that are globally and locally consistent and faithfully capture the input guidance, allowing us to render the generated 3D assets. With Garment3DGen users can generate the simulation-ready 3D garment of their choice without the need of artist intervention. We present a plethora of quantitative and qualitative comparisons on various assets and demonstrate that Garment3DGen unlocks key applications ranging from sketch-to-simulated garments or interacting with the garments in VR. Code is publicly available. △ Less

Submitted 13 August, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

Comments: Project Page and Code: https://nsarafianos.github.io/garment3dgen

arXiv:2403.18730 [pdf, other]

Towards Image Ambient Lighting Normalization

Authors: Florin-Alexandru Vasluianu, Tim Seizinger, Zongwei Wu, Rakesh Ranjan, Radu Timofte

Abstract: Lighting normalization is a crucial but underexplored restoration task with broad applications. However, existing works often simplify this task within the context of shadow removal, limiting the light sources to one and oversimplifying the scene, thus excluding complex self-shadows and restricting surface classes to smooth ones. Although promising, such simplifications hinder generalizability to… ▽ More Lighting normalization is a crucial but underexplored restoration task with broad applications. However, existing works often simplify this task within the context of shadow removal, limiting the light sources to one and oversimplifying the scene, thus excluding complex self-shadows and restricting surface classes to smooth ones. Although promising, such simplifications hinder generalizability to more realistic settings encountered in daily use. In this paper, we propose a new challenging task termed Ambient Lighting Normalization (ALN), which enables the study of interactions between shadows, unifying image restoration and shadow removal in a broader context. To address the lack of appropriate datasets for ALN, we introduce the large-scale high-resolution dataset Ambient6K, comprising samples obtained from multiple light sources and including self-shadows resulting from complex geometries, which is the first of its kind. For benchmarking, we select various mainstream methods and rigorously evaluate them on Ambient6K. Additionally, we propose IFBlend, a novel strong baseline that maximizes Image-Frequency joint entropy to selectively restore local areas under different lighting conditions, without relying on shadow localization priors. Experiments show that IFBlend achieves SOTA scores on Ambient6K and exhibits competitive performance on conventional shadow removal benchmarks compared to shadow-specific models with mask priors. The dataset, benchmark, and code are available at https://github.com/fvasluianu97/IFBlend. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2402.12889 [pdf, other]

doi 10.1109/TC.2024.3365953

BFT-DSN: A Byzantine Fault Tolerant Decentralized Storage Network

Authors: Hechuan Guo, Minghui Xu, Jiahao Zhang, Chunchi Liu, Rajiv Ranjan, Dongxiao Yu, Xiuzhen Cheng

Abstract: With the rapid development of blockchain and its applications, the amount of data stored on decentralized storage networks (DSNs) has grown exponentially. DSNs bring together affordable storage resources from around the world to provide robust, decentralized storage services for tens of thousands of decentralized applications (dApps). However, existing DSNs do not offer verifiability when implemen… ▽ More With the rapid development of blockchain and its applications, the amount of data stored on decentralized storage networks (DSNs) has grown exponentially. DSNs bring together affordable storage resources from around the world to provide robust, decentralized storage services for tens of thousands of decentralized applications (dApps). However, existing DSNs do not offer verifiability when implementing erasure coding for redundant storage, making them vulnerable to Byzantine encoders. Additionally, there is a lack of Byzantine fault-tolerant consensus for optimal resilience in DSNs. This paper introduces BFT-DSN, a Byzantine fault-tolerant decentralized storage network designed to address these challenges. BFT-DSN combines storage-weighted BFT consensus with erasure coding and incorporates homomorphic fingerprints and weighted threshold signatures for decentralized verification. The implementation of BFT-DSN demonstrates its comparable performance in terms of storage cost and latency as well as superior performance in Byzantine resilience when compared to existing industrial decentralized storage networks. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: 11 pages, 8 figures

arXiv:2402.12712 [pdf, other]

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Authors: Shitao Tang, Jiacheng Chen, Dilin Wang, Chengzhou Tang, Fuyang Zhang, Yuchen Fan, Vikas Chandra, Yasutaka Furukawa, Rakesh Ranjan

Abstract: This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses. MVDiffusion++ achieves superior flexibility and scalability with two surprisingly simple ideas: 1) A ``pose-free architecture'' where standard self-attention among 2D latent features learns 3D consistency… ▽ More This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses. MVDiffusion++ achieves superior flexibility and scalability with two surprisingly simple ideas: 1) A ``pose-free architecture'' where standard self-attention among 2D latent features learns 3D consistency across an arbitrary number of conditional and generation views without explicitly using camera pose information; and 2) A ``view dropout strategy'' that discards a substantial number of output views during training, which reduces the training-time memory footprint and enables dense and high-resolution view synthesis at test time. We use the Objaverse for training and the Google Scanned Objects for evaluation with standard novel view synthesis and 3D reconstruction metrics, where MVDiffusion++ significantly outperforms the current state of the arts. We also demonstrate a text-to-3D application example by combining MVDiffusion++ with a text-to-image generative model. The project page is at https://mvdiffusion-plusplus.github.io. △ Less

Submitted 30 April, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: 3D generation, project page: https://mvdiffusion-plusplus.github.io/

arXiv:2402.09233 [pdf, other]

Design and Realization of a Benchmarking Testbed for Evaluating Autonomous Platooning Algorithms

Authors: Michael Shaham, Risha Ranjan, Engin Kirda, Taskin Padir

Abstract: Autonomous vehicle platoons present near- and long-term opportunities to enhance operational efficiencies and save lives. The past 30 years have seen rapid development in the autonomous driving space, enabling new technologies that will alleviate the strain placed on human drivers and reduce vehicle emissions. This paper introduces a testbed for evaluating and benchmarking platooning algorithms on… ▽ More Autonomous vehicle platoons present near- and long-term opportunities to enhance operational efficiencies and save lives. The past 30 years have seen rapid development in the autonomous driving space, enabling new technologies that will alleviate the strain placed on human drivers and reduce vehicle emissions. This paper introduces a testbed for evaluating and benchmarking platooning algorithms on 1/10th scale vehicles with onboard sensors. To demonstrate the testbed's utility, we evaluate three algorithms, linear feedback and two variations of distributed model predictive control, and compare their results on a typical platooning scenario where the lead vehicle tracks a reference trajectory that changes speed multiple times. We validate our algorithms in simulation to analyze the performance as the platoon size increases, and find that the distributed model predictive control algorithms outperform linear feedback on hardware and in simulation. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: To be published in International Symposium on Experimental Robotics, 2023

arXiv:2402.02634 [pdf, other]

Key-Graph Transformer for Image Restoration

Authors: Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Nicu Sebe

Abstract: While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In… ▽ More While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In response to these challenges, we introduce the Key-Graph Transformer (KGT) in this paper. Specifically, KGT views patch features as graph nodes. The proposed Key-Graph Constructor efficiently forms a sparse yet representative Key-Graph by selectively connecting essential nodes instead of all the nodes. Then the proposed Key-Graph Attention is conducted under the guidance of the Key-Graph only among selected nodes with linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed KGT's state-of-the-art performance, showcasing advancements both quantitatively and qualitatively. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: 9 pages, 6 figures

arXiv:2402.00863 [pdf, other]

Geometry Transfer for Stylizing Radiance Fields

Authors: Hyunyoung Jung, Seonghyeon Nam, Nikolaos Sarafianos, Sungjoo Yoo, Alexander Sorkine-Hornung, Rakesh Ranjan

Abstract: Shape and geometric patterns are essential in defining stylistic identity. However, current 3D style transfer methods predominantly focus on transferring colors and textures, often overlooking geometric aspects. In this paper, we introduce Geometry Transfer, a novel method that leverages geometric deformation for 3D style transfer. This technique employs depth maps to extract a style guide, subseq… ▽ More Shape and geometric patterns are essential in defining stylistic identity. However, current 3D style transfer methods predominantly focus on transferring colors and textures, often overlooking geometric aspects. In this paper, we introduce Geometry Transfer, a novel method that leverages geometric deformation for 3D style transfer. This technique employs depth maps to extract a style guide, subsequently applied to stylize the geometry of radiance fields. Moreover, we propose new techniques that utilize geometric cues from the 3D scene, thereby enhancing aesthetic expressiveness and more accurately reflecting intended styles. Our extensive experiments show that Geometry Transfer enables a broader and more expressive range of stylizations, thereby significantly expanding the scope of 3D style transfer. △ Less

Submitted 6 April, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: CVPR 2024. Project page: https://hyblue.github.io/geo-srf/

arXiv:2401.00909 [pdf, other]

Taming Mode Collapse in Score Distillation for Text-to-3D Generation

Authors: Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra

Abstract: Despite the remarkable performance of score distillation in text-to-3D generation, such techniques notoriously suffer from view inconsistency issues, also known as "Janus" artifact, where the generated objects fake each view with multiple front faces. Although empirically effective methods have approached this problem via score debiasing or prompt engineering, a more rigorous perspective to explai… ▽ More Despite the remarkable performance of score distillation in text-to-3D generation, such techniques notoriously suffer from view inconsistency issues, also known as "Janus" artifact, where the generated objects fake each view with multiple front faces. Although empirically effective methods have approached this problem via score debiasing or prompt engineering, a more rigorous perspective to explain and tackle this problem remains elusive. In this paper, we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem, manifesting as the Janus artifact in practice. To tame mode collapse, we improve score distillation by re-establishing the entropy term in the corresponding variational objective, which is applied to the distribution of rendered images. Maximizing the entropy encourages diversity among different views in generated 3D assets, thereby mitigating the Janus problem. Based on this new objective, we derive a new update rule for 3D score distillation, dubbed Entropic Score Distillation (ESD). We theoretically reveal that ESD can be simplified and implemented by just adopting the classifier-free guidance trick upon variational score distillation. Although embarrassingly straightforward, our extensive experiments successfully demonstrate that ESD can be an effective treatment for Janus artifacts in score distillation. △ Less

Submitted 29 March, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

Comments: Project page: https://vita-group.github.io/3D-Mode-Collapse/

arXiv:2401.00604 [pdf, other]

SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity

Authors: Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra

Abstract: Score distillation has emerged as one of the most prevalent approaches for text-to-3D asset synthesis. Essentially, score distillation updates 3D parameters by lifting and back-propagating scores averaged over different views. In this paper, we reveal that the gradient estimation in score distillation is inherent to high variance. Through the lens of variance reduction, the effectiveness of SDS an… ▽ More Score distillation has emerged as one of the most prevalent approaches for text-to-3D asset synthesis. Essentially, score distillation updates 3D parameters by lifting and back-propagating scores averaged over different views. In this paper, we reveal that the gradient estimation in score distillation is inherent to high variance. Through the lens of variance reduction, the effectiveness of SDS and VSD can be interpreted as applications of various control variates to the Monte Carlo estimator of the distilled score. Motivated by this rethinking and based on Stein's identity, we propose a more general solution to reduce variance for score distillation, termed Stein Score Distillation (SSD). SSD incorporates control variates constructed by Stein identity, allowing for arbitrary baseline functions. This enables us to include flexible guidance priors and network architectures to explicitly optimize for variance reduction. In our experiments, the overall pipeline, dubbed SteinDreamer, is implemented by instantiating the control variate with a monocular depth estimator. The results suggest that SSD can effectively reduce the distillation variance and consistently improve visual quality for both object- and scene-level generation. Moreover, we demonstrate that SteinDreamer achieves faster convergence than existing methods due to more stable gradient updates. △ Less

Submitted 29 March, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

Comments: Project page: https://vita-group.github.io/SteinDreamer/

arXiv:2312.15627 [pdf]

Ultrahigh electrostrain in Pb-free piezoceramics: Effect of bending

Authors: Gobinda Das Adhikary, John Daniels, Luke Giles, Rajeev Ranjan

Abstract: Recently several reports showing ultra-high electrostrain (> 1 %) have appeared in Pb-free piezoceramics. However, there is lack of clarity on the nature of the ultrahigh strain. Here, we demonsrate that the ultrahigh strain is a consequence of bending of the disc. We show that the propensity for bending arises from the difference in the response magnitude of the grains at the positive and negativ… ▽ More Recently several reports showing ultra-high electrostrain (> 1 %) have appeared in Pb-free piezoceramics. However, there is lack of clarity on the nature of the ultrahigh strain. Here, we demonsrate that the ultrahigh strain is a consequence of bending of the disc. We show that the propensity for bending arises from the difference in the response magnitude of the grains at the positive and negative surfaces of the piezoceramic when the field is applied. △ Less

Submitted 25 December, 2023; originally announced December 2023.

Comments: 8 pages 4 figures

arXiv:2312.14239 [pdf, other]

PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar

Authors: Tzofi Klinghoffer, Xiaoyu Xiang, Siddharth Somasundaram, Yuchen Fan, Christian Richardt, Ramesh Raskar, Rakesh Ranjan

Abstract: 3D reconstruction from a single-view is challenging because of the ambiguity from monocular cues and lack of information about occluded regions. Neural radiance fields (NeRF), while popular for view synthesis and 3D reconstruction, are typically reliant on multi-view images. Existing methods for single-view 3D reconstruction with NeRF rely on either data priors to hallucinate views of occluded reg… ▽ More 3D reconstruction from a single-view is challenging because of the ambiguity from monocular cues and lack of information about occluded regions. Neural radiance fields (NeRF), while popular for view synthesis and 3D reconstruction, are typically reliant on multi-view images. Existing methods for single-view 3D reconstruction with NeRF rely on either data priors to hallucinate views of occluded regions, which may not be physically accurate, or shadows observed by RGB cameras, which are difficult to detect in ambient light and low albedo backgrounds. We propose using time-of-flight data captured by a single-photon avalanche diode to overcome these limitations. Our method models two-bounce optical paths with NeRF, using lidar transient data for supervision. By leveraging the advantages of both NeRF and two-bounce light measured by lidar, we demonstrate that we can reconstruct visible and occluded geometry without data priors or reliance on controlled ambient lighting or scene albedo. In addition, we demonstrate improved generalization under practical constraints on sensor spatial- and temporal-resolution. We believe our method is a promising direction as single-photon lidars become ubiquitous on consumer devices, such as phones, tablets, and headsets. △ Less

Submitted 5 April, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: CVPR 2024. Project Page: https://platonerf.github.io/

arXiv:2312.04615 [pdf, other]

Relational Deep Learning: Graph Representation Learning on Relational Databases

Authors: Matthias Fey, Weihua Hu, Kexin Huang, Jan Eric Lenssen, Rishabh Ranjan, Joshua Robinson, Rex Ying, Jiaxuan You, Jure Leskovec

Abstract: Much of the world's most valued data is stored in relational databases and data warehouses, where the data is organized into many tables connected by primary-foreign key relations. However, building machine learning models using this data is both challenging and time consuming. The core problem is that no machine learning method is capable of learning on multiple tables interconnected by primary-f… ▽ More Much of the world's most valued data is stored in relational databases and data warehouses, where the data is organized into many tables connected by primary-foreign key relations. However, building machine learning models using this data is both challenging and time consuming. The core problem is that no machine learning method is capable of learning on multiple tables interconnected by primary-foreign key relations. Current methods can only learn from a single table, so the data must first be manually joined and aggregated into a single training table, the process known as feature engineering. Feature engineering is slow, error prone and leads to suboptimal models. Here we introduce an end-to-end deep representation learning approach to directly learn on data laid out across multiple tables. We name our approach Relational Deep Learning (RDL). The core idea is to view relational databases as a temporal, heterogeneous graph, with a node for each row in each table, and edges specified by primary-foreign key links. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all input data, without any manual feature engineering. Relational Deep Learning leads to more accurate models that can be built much faster. To facilitate research in this area, we develop RelBench, a set of benchmark datasets and an implementation of Relational Deep Learning. The data covers a wide spectrum, from discussions on Stack Exchange to book reviews on the Amazon Product Catalog. Overall, we define a new research area that generalizes graph machine learning and broadens its applicability to a wide set of AI use cases. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: https://relbench.stanford.edu

arXiv:2312.03640 [pdf, other]

Training Neural Networks on RAW and HDR Images for Restoration Tasks

Authors: Lei Luo, Alexandre Chapiro, Xiaoyu Xiang, Yuchen Fan, Rakesh Ranjan, Rafal Mantiuk

Abstract: The vast majority of standard image and video content available online is represented in display-encoded color spaces, in which pixel values are conveniently scaled to a limited range (0-1) and the color distribution is approximately perceptually uniform. In contrast, both camera RAW and high dynamic range (HDR) images are often represented in linear color spaces, in which color values are linearl… ▽ More The vast majority of standard image and video content available online is represented in display-encoded color spaces, in which pixel values are conveniently scaled to a limited range (0-1) and the color distribution is approximately perceptually uniform. In contrast, both camera RAW and high dynamic range (HDR) images are often represented in linear color spaces, in which color values are linearly related to colorimetric quantities of light. While training on commonly available display-encoded images is a well-established practice, there is no consensus on how neural networks should be trained for tasks on RAW and HDR images in linear color spaces. In this work, we test several approaches on three popular image restoration applications: denoising, deblurring, and single-image super-resolution. We examine whether HDR/RAW images need to be display-encoded using popular transfer functions (PQ, PU21, mu-law), or whether it is better to train in linear color spaces, but use loss functions that correct for perceptual non-uniformity. Our results indicate that neural networks train significantly better on HDR and RAW images represented in display-encoded color spaces, which offer better perceptual uniformity than linear spaces. This small change to the training strategy can bring a very substantial gain in performance, up to 10-15 dB. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.11325 [pdf, other]

MoVideo: Motion-Aware Video Generation with Diffusion Models

Authors: Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, Rakesh Ranjan

Abstract: While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspe… ▽ More While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality. △ Less

Submitted 29 July, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

Comments: Accepted by ECCV2024. Project page: https://jingyunliang.github.io/MoVideo

arXiv:2310.09653 [pdf, other]

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Authors: Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley

Abstract: We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on factorizing speech into explicitly disentangled representations that separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss te… ▽ More We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on factorizing speech into explicitly disentangled representations that separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss terms can lead to information loss. In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning (SSL) and speaker verification models. First, we develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Next, we propose a training strategy to iteratively improve the synthesis model for voice conversion, by creating a challenging training objective using self-synthesized examples. We demonstrate that incorporating such self-synthesized examples during training improves the speaker similarity of generated speech as compared to a baseline voice conversion model trained solely on heuristically perturbed inputs. Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio. △ Less

Submitted 3 May, 2024; v1 submitted 14 October, 2023; originally announced October 2023.

Comments: Accepted at ICML 2024

arXiv:2310.08805 [pdf, other]

doi 10.1007/978-3-031-52448-6_22

Two-Stage Deep Learning Framework for Quality Assessment of Left Atrial Late Gadolinium Enhanced MRI Images

Authors: K M Arefeen Sultan, Benjamin Orkild, Alan Morris, Eugene Kholmovski, Erik Bieging, Eugene Kwan, Ravi Ranjan, Ed DiBella, Shireen Elhabian

Abstract: Accurate assessment of left atrial fibrosis in patients with atrial fibrillation relies on high-quality 3D late gadolinium enhancement (LGE) MRI images. However, obtaining such images is challenging due to patient motion, changing breathing patterns, or sub-optimal choice of pulse sequence parameters. Automated assessment of LGE-MRI image diagnostic quality is clinically significant as it would en… ▽ More Accurate assessment of left atrial fibrosis in patients with atrial fibrillation relies on high-quality 3D late gadolinium enhancement (LGE) MRI images. However, obtaining such images is challenging due to patient motion, changing breathing patterns, or sub-optimal choice of pulse sequence parameters. Automated assessment of LGE-MRI image diagnostic quality is clinically significant as it would enhance diagnostic accuracy, improve efficiency, ensure standardization, and contributes to better patient outcomes by providing reliable and high-quality LGE-MRI scans for fibrosis quantification and treatment planning. To address this, we propose a two-stage deep-learning approach for automated LGE-MRI image diagnostic quality assessment. The method includes a left atrium detector to focus on relevant regions and a deep network to evaluate diagnostic quality. We explore two training strategies, multi-task learning, and pretraining using contrastive learning, to overcome limited annotated data in medical imaging. Contrastive Learning result shows about $4\%$, and $9\%$ improvement in F1-Score and Specificity compared to Multi-Task learning when there's limited data. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: Accepted to STACOM 2023. 11 pages, 3 figures

arXiv:2309.13433 [pdf]

Giant electrostriction in bulk RE (III) substituted CeO2: effect of RE 3+ and its concentration

Authors: Soumyajyoti Mondal, Pooja Punetha, Rajeev Ranjan, Pavan Nukala

Abstract: Recent discovery of giant electrostriction in rare earth (RE (III)) substituted ceria (CeO2) thin films driven by electroactive defect complexes and their coordinated elastic response, expands the material spectrum for electrostrain applications beyond the conventional piezoelectric materials. Especially Gd substituted CeO2, with Gd concentration >10% seems to be an ideal material to obtain such l… ▽ More Recent discovery of giant electrostriction in rare earth (RE (III)) substituted ceria (CeO2) thin films driven by electroactive defect complexes and their coordinated elastic response, expands the material spectrum for electrostrain applications beyond the conventional piezoelectric materials. Especially Gd substituted CeO2, with Gd concentration >10% seems to be an ideal material to obtain such large electrostrain response. However, there are not many experimental studies that systematically investigate the effect of RE (III) ion-defect interaction and RE concentration on electrostriction. Here we perform structure-property correlation studies in bulk ceramics of RE3+ substituted ceria doped with RE=Y, La and Gd at various concentrations upto a maximum of 20%, to understand the features responsible for giant electrostriction. Our results show that Y substituted ceria, with atleast 20% Y substitution, is clearly both a giant M and a Q electrostrictor at low frequencies (<20 Hz), and this correlates with the unique attractive defect-dopant interaction of Y with oxygen vacancies. La has a repulsive interaction with oxygen vacancies, and La doped ceria at all the studied compositions (upto 20%) does not show giant electrostiction. Gd has a neutral interaction, and only 20% Gd doped ceria at best falls at the border of classification between giant and non-giant electrostrictors at frequencies <0.05 Hz. Our work takes a step back from thin-films and assesses the fundamental defect features required in the design of giant electrostrictors. △ Less

Submitted 23 September, 2023; originally announced September 2023.

Comments: 11 pages, 8 figures

arXiv:2308.15266 [pdf, other]

NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation

Authors: Tim Meinhardt, Matt Feiszli, Yuchen Fan, Laura Leal-Taixe, Rakesh Ranjan

Abstract: Until recently, the Video Instance Segmentation (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing. However, the recent success of online methods questions this belief, in particular, for challenging and long video sequences. We understand this work as a rebuttal of those recent observations and an appeal to the commun… ▽ More Until recently, the Video Instance Segmentation (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing. However, the recent success of online methods questions this belief, in particular, for challenging and long video sequences. We understand this work as a rebuttal of those recent observations and an appeal to the community to focus on dedicated near-online VIS approaches. To support our argument, we present a detailed analysis on different processing paradigms and the new end-to-end trainable NOVIS (Near-Online Video Instance Segmentation) method. Our transformer-based model directly predicts spatio-temporal mask volumes for clips of frames and performs instance tracking between clips via overlap embeddings. NOVIS represents the first near-online VIS approach which avoids any handcrafted tracking heuristics. We outperform all existing VIS methods by large margins and provide new state-of-the-art results on both YouTube-VIS (2019/2021) and the OVIS benchmarks. △ Less

Submitted 18 September, 2023; v1 submitted 29 August, 2023; originally announced August 2023.

arXiv:2307.06669 [pdf, other]

Uncovering the Deceptions: An Analysis on Audio Spoofing Detection and Future Prospects

Authors: Rishabh Ranjan, Mayank Vatsa, Richa Singh

Abstract: Audio has become an increasingly crucial biometric modality due to its ability to provide an intuitive way for humans to interact with machines. It is currently being used for a range of applications, including person authentication to banking to virtual assistants. Research has shown that these systems are also susceptible to spoofing and attacks. Therefore, protecting audio processing systems ag… ▽ More Audio has become an increasingly crucial biometric modality due to its ability to provide an intuitive way for humans to interact with machines. It is currently being used for a range of applications, including person authentication to banking to virtual assistants. Research has shown that these systems are also susceptible to spoofing and attacks. Therefore, protecting audio processing systems against fraudulent activities, such as identity theft, financial fraud, and spreading misinformation, is of paramount importance. This paper reviews the current state-of-the-art techniques for detecting audio spoofing and discusses the current challenges along with open research problems. The paper further highlights the importance of considering the ethical and privacy implications of audio spoofing detection systems. Lastly, the work aims to accentuate the need for building more robust and generalizable methods, the integration of automatic speaker verification and countermeasure systems, and better evaluation protocols. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: Accepted in IJCAI 2023

arXiv:2306.03343 [pdf]

doi 10.1063/5.0163502

Ultrahigh electrostrain > 1% in lead-free piezoceramics: A critical review

Authors: Gobinda Das Adhikary, Digivijay Narayan Singh, Getaw Abebe Tina, Gudeta Jafo Muleta, Rajeev Ranjan

Abstract: Recently, a series of reports showing ultra-high electrostrain (> 1 %) have appeared in several Pb-free piezoceramics. The ultrahigh electrostrain has been attributed exclusively to the defect dipoles created in these systems. We examine these claims based on another report arXiv:2208.07134 which demonstrated that the measured electric field driven strain increased dramatically simply by reducing… ▽ More Recently, a series of reports showing ultra-high electrostrain (> 1 %) have appeared in several Pb-free piezoceramics. The ultrahigh electrostrain has been attributed exclusively to the defect dipoles created in these systems. We examine these claims based on another report arXiv:2208.07134 which demonstrated that the measured electric field driven strain increased dramatically simply by reducing the thickness of the ceramic discs. We prepared some representative Pb-free compositions reported to exhibit ultrahigh strain and performed electrostrain measurements. We found that these compositions do not show ultrahigh electrostrain if the thickness of the discs is above 0.3 mm (the disc diameters were in the range 10- 12 mm diameter). The ultrahigh strain values were obtained when the thickness was below 0.3 mm. We compare the electrostrain obtained from specimens designed to exhibit defect dipoles with specimens that were not designed to have defect dipoles in Na0.5Bi0.5TiO3 (NBT) and K0.5Na0.5NbO3 (KNN) -based lead-free systems and could obtain much higher strain levels (4- 5 %) in the defect dipole free piezoceramics in the small thickness regime. Our results do not favor the defect dipole theory as the exclusive factor for causing ultrahigh strain in piezoceramics. A new approach is called for to understand the phenomenon of ultrahigh electrostrain caused by the thickness reduction of piezoceramic discs. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: 9 pages, 5 figures, lab report

arXiv:2305.07214 [pdf, other]

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

Authors: Xinyu Gong, Sreyas Mohan, Naina Dhingra, Jean-Charles Bazin, Yilei Li, Zhangyang Wang, Rakesh Ranjan

Abstract: In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action cat… ▽ More In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at https://github.com/facebookresearch/MMG_Ego4D. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: Accepted to CVPR 2023

arXiv:2304.10537 [pdf, other]

Learning Neural Duplex Radiance Fields for Real-Time View Synthesis

Authors: Ziyu Wan, Christian Richardt, Aljaž Božič, Chao Li, Vijay Rengarajan, Seonghyeon Nam, Xiaoyu Xiang, Tuotuo Li, Bo Zhu, Rakesh Ranjan, Jing Liao

Abstract: Neural radiance fields (NeRFs) enable novel view synthesis with unprecedented visual quality. However, to render photorealistic images, NeRFs require hundreds of deep multilayer perceptron (MLP) evaluations - for each pixel. This is prohibitively expensive and makes real-time rendering infeasible, even on powerful modern GPUs. In this paper, we propose a novel approach to distill and bake NeRFs in… ▽ More Neural radiance fields (NeRFs) enable novel view synthesis with unprecedented visual quality. However, to render photorealistic images, NeRFs require hundreds of deep multilayer perceptron (MLP) evaluations - for each pixel. This is prohibitively expensive and makes real-time rendering infeasible, even on powerful modern GPUs. In this paper, we propose a novel approach to distill and bake NeRFs into highly efficient mesh-based neural representations that are fully compatible with the massively parallel graphics rendering pipeline. We represent scenes as neural radiance features encoded on a two-layer duplex mesh, which effectively overcomes the inherent inaccuracies in 3D surface reconstruction by learning the aggregated radiance information from a reliable interval of ray-surface intersections. To exploit local geometric relationships of nearby pixels, we leverage screen-space convolutions instead of the MLPs used in NeRFs to achieve high-quality appearance. Finally, the performance of the whole framework is further boosted by a novel multi-view distillation optimization strategy. We demonstrate the effectiveness and superiority of our approach via extensive experiments on a range of standard datasets. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: CVPR 2023. Project page: http://raywzy.com/NDRF

arXiv:2303.16493 [pdf, other]

AnyFlow: Arbitrary Scale Optical Flow with Implicit Neural Representation

Authors: Hyunyoung Jung, Zhuo Hui, Lei Luo, Haitao Yang, Feng Liu, Sungjoo Yoo, Rakesh Ranjan, Denis Demandolx

Abstract: To apply optical flow in practice, it is often necessary to resize the input to smaller dimensions in order to reduce computational costs. However, downsizing inputs makes the estimation more challenging because objects and motion ranges become smaller. Even though recent approaches have demonstrated high-quality flow estimation, they tend to fail to accurately model small objects and precise boun… ▽ More To apply optical flow in practice, it is often necessary to resize the input to smaller dimensions in order to reduce computational costs. However, downsizing inputs makes the estimation more challenging because objects and motion ranges become smaller. Even though recent approaches have demonstrated high-quality flow estimation, they tend to fail to accurately model small objects and precise boundaries when the input resolution is lowered, restricting their applicability to high-resolution inputs. In this paper, we introduce AnyFlow, a robust network that estimates accurate flow from images of various resolutions. By representing optical flow as a continuous coordinate-based representation, AnyFlow generates outputs at arbitrary scales from low-resolution inputs, demonstrating superior performance over prior works in capturing tiny objects with detail preservation on a wide range of scenes. We establish a new state-of-the-art performance of cross-dataset generalization on the KITTI dataset, while achieving comparable accuracy on the online benchmarks to other SOTA methods. △ Less

Submitted 29 March, 2023; originally announced March 2023.

Comments: CVPR 2023 (Highlight)

arXiv:2303.03286 [pdf]

Giant electromechanical response from defective non-ferroelectric epitaxial BaTiO3 integrated on Si 100

Authors: Sandeep Vura, Shubham Kumar Parate, Subhajit Pal, Upanya Khandelwal, Rajeev Kumar Rai, Sri Harsha Molleti, Vishnu Kumar, Rama Satya Sandilya Ventrapragada, Girish Patil, Mudit Jain, Ambresh Mallya, Majid Ahmadi, Bart Kooi, Sushobhan Avasthi, Rajeev Ranjan, Srinivasan Raghavan, Saurabh Chandorkar, Pavan Nukala

Abstract: Lead free, silicon compatible materials showing large electromechanical responses comparable to, or better than conventional relaxor ferroelectrics, are desirable for various nanoelectromechanical devices and applications. Defect-engineered electrostriction has recently been gaining popularity to obtain enhanced electromechanical responses at sub 100 Hz frequencies. Here, we report record values o… ▽ More Lead free, silicon compatible materials showing large electromechanical responses comparable to, or better than conventional relaxor ferroelectrics, are desirable for various nanoelectromechanical devices and applications. Defect-engineered electrostriction has recently been gaining popularity to obtain enhanced electromechanical responses at sub 100 Hz frequencies. Here, we report record values of electrostrictive strain coefficients (M31) at frequencies as large as 5 kHz (1.04 x 10-14 m2 per V2 at 1 kHz, and 3.87 x 10-15 m2 per V2 at 5 kHz) using A-site and oxygen-deficient barium titanate thin-films, epitaxially integrated onto Si. The effect is robust and retained even after cycling the devices >5000 times. Our perovskite films are non-ferroelectric, exhibit a different symmetry compared to stoichiometric BaTiO3 and are characterized by twin boundaries and nano polar-like regions. We show that the dielectric relaxation arising from the defect-induced features correlates very well with the observed giant electrostrictive response. These films show large coefficient of thermal expansion (2.36 x 10-5/K), which along with the giant M31 implies a considerable increase in the lattice anharmonicity induced by the defects. Our work provides a crucial step forward towards formulating guidelines to engineer large electromechanical responses even at higher frequencies in lead-free thin films. △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: 26 pages, 4 figures, 8 supplementary figures

arXiv:2303.00748 [pdf, other]

Efficient and Explicit Modelling of Image Hierarchies for Image Restoration

Authors: Yawei Li, Yuchen Fan, Xiaoyu Xiang, Denis Demandolx, Rakesh Ranjan, Radu Timofte, Luc Van Gool

Abstract: The aim of this paper is to propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration. To achieve that, we start by analyzing two important properties of natural images including cross-scale similarity and anisotropic image features. Inspired by that, we propose the anchored stripe self-attention which achieves a good b… ▽ More The aim of this paper is to propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration. To achieve that, we start by analyzing two important properties of natural images including cross-scale similarity and anisotropic image features. Inspired by that, we propose the anchored stripe self-attention which achieves a good balance between the space and time complexity of self-attention and the modelling capacity beyond the regional range. Then we propose a new network architecture dubbed GRL to explicitly model image hierarchies in the Global, Regional, and Local range via anchored stripe self-attention, window self-attention, and channel attention enhanced convolution. Finally, the proposed network is applied to 7 image restoration types, covering both real and synthetic settings. The proposed method sets the new state-of-the-art for several of those. Code will be available at https://github.com/ofsoundof/GRL-Image-Restoration.git. △ Less

Submitted 25 May, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: Accepted by CVPR 2023. 12 pages, 7 figures, 11 tables

arXiv:2212.07265 [pdf, other]

Cross-Channel: Scalable Off-Chain Channels Supporting Fair and Atomic Cross-Chain Operations

Authors: Yihao Guo, Minghui Xu, Dongxiao Yu, Yong Yu, Rajiv Ranjan, Xiuzhen Cheng

Abstract: Cross-chain technology facilitates the interoperability among isolated blockchains on which users can freely communicate and transfer values. Existing cross-chain protocols suffer from the scalability problem when processing on-chain transactions. Off-chain channels, as a promising blockchain scaling technique, can enable micro-payment transactions without involving on-chain transaction settlement… ▽ More Cross-chain technology facilitates the interoperability among isolated blockchains on which users can freely communicate and transfer values. Existing cross-chain protocols suffer from the scalability problem when processing on-chain transactions. Off-chain channels, as a promising blockchain scaling technique, can enable micro-payment transactions without involving on-chain transaction settlement. However, existing channel schemes can only be applied to operations within a single blockchain, failing to support cross-chain services. Therefore in this paper, we propose Cross-Channel, the first off-chain channel to support cross-chain services. We introduce a novel hierarchical channel structure, a new hierarchical settlement protocol, and a smart general fair exchange protocol, to ensure scalability, fairness, and atomicity of cross-chain interactions. Besides, Cross-Channel provides strong security and practicality by avoiding high latency in asynchronous networks. Through a 50-instance deployment of Cross-Channel on AliCloud, we demonstrate that Cross-Channel is well-suited for processing cross-chain transactions in high-frequency and large-scale, and brings a significantly enhanced throughput with a small amount of gas and delay overhead. △ Less

Submitted 14 December, 2022; originally announced December 2022.

arXiv:2212.03961 [pdf, other]

FSID: Fully Synthetic Image Denoising via Procedural Scene Generation

Authors: Gyeongmin Choe, Beibei Du, Seonghyeon Nam, Xiaoyu Xiang, Bo Zhu, Rakesh Ranjan

Abstract: For low-level computer vision and image processing ML tasks, training on large datasets is critical for generalization. However, the standard practice of relying on real-world images primarily from the Internet comes with image quality, scalability, and privacy issues, especially in commercial contexts. To address this, we have developed a procedural synthetic data generation pipeline and dataset… ▽ More For low-level computer vision and image processing ML tasks, training on large datasets is critical for generalization. However, the standard practice of relying on real-world images primarily from the Internet comes with image quality, scalability, and privacy issues, especially in commercial contexts. To address this, we have developed a procedural synthetic data generation pipeline and dataset tailored to low-level vision tasks. Our Unreal engine-based synthetic data pipeline populates large scenes algorithmically with a combination of random 3D objects, materials, and geometric transformations. Then, we calibrate the camera noise profiles to synthesize the noisy images. From this pipeline, we generated a fully synthetic image denoising dataset (FSID) which consists of 175,000 noisy/clean image pairs. We then trained and validated a CNN-based denoising model, and demonstrated that the model trained on this synthetic data alone can achieve competitive denoising results when evaluated on real-world noisy images captured with smartphone cameras. △ Less

Submitted 7 December, 2022; originally announced December 2022.

arXiv:2212.01747 [pdf, other]

Fast Point Cloud Generation with Straight Flows

Authors: Lemeng Wu, Dilin Wang, Chengyue Gong, Xingchao Liu, Yunyang Xiong, Rakesh Ranjan, Raghuraman Krishnamoorthi, Vikas Chandra, Qiang Liu

Abstract: Diffusion models have emerged as a powerful tool for point cloud generation. A key component that drives the impressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of learning steps has limited its applications to many 3D real-world. To address this limitation, we propose Point Straight Flow (PSF), a mod… ▽ More Diffusion models have emerged as a powerful tool for point cloud generation. A key component that drives the impressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of learning steps has limited its applications to many 3D real-world. To address this limitation, we propose Point Straight Flow (PSF), a model that exhibits impressive performance using one step. Our idea is based on the reformulation of the standard diffusion model, which optimizes the curvy learning trajectory into a straight path. Further, we develop a distillation strategy to shorten the straight path into one step without a performance loss, enabling applications to 3D real-world with latency constraints. We perform evaluations on multiple 3D tasks and find that our PSF performs comparably to the standard diffusion model, outperforming other efficient 3D point cloud generation methods. On real-world applications such as point cloud completion and training-free text-guided generation in a low-latency setup, PSF performs favorably. △ Less

Submitted 4 December, 2022; originally announced December 2022.

arXiv:2211.08658 [pdf, other]

Consistent Direct Time-of-Flight Video Depth Super-Resolution

Authors: Zhanghao Sun, Wei Ye, Jinhui Xiong, Gyeongmin Choe, Jialiang Wang, Shuochen Su, Rakesh Ranjan

Abstract: Direct time-of-flight (dToF) sensors are promising for next-generation on-device 3D sensing. However, limited by manufacturing capabilities in a compact module, the dToF data has a low spatial resolution (e.g., $\sim 20\times30$ for iPhone dToF), and it requires a super-resolution step before being passed to downstream tasks. In this paper, we solve this super-resolution problem by fusing the low-… ▽ More Direct time-of-flight (dToF) sensors are promising for next-generation on-device 3D sensing. However, limited by manufacturing capabilities in a compact module, the dToF data has a low spatial resolution (e.g., $\sim 20\times30$ for iPhone dToF), and it requires a super-resolution step before being passed to downstream tasks. In this paper, we solve this super-resolution problem by fusing the low-resolution dToF data with the corresponding high-resolution RGB guidance. Unlike the conventional RGB-guided depth enhancement approaches, which perform the fusion in a per-frame manner, we propose the first multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the low-resolution dToF imaging. In addition, dToF sensors provide unique depth histogram information for each local patch, and we incorporate this dToF-specific feature in our network design to further alleviate spatial ambiguity. To evaluate our models on complex dynamic indoor environments and to provide a large-scale dToF sensor dataset, we introduce DyDToF, the first synthetic RGB-dToF video dataset that features dynamic objects and a realistic dToF simulator following the physical imaging process. We believe the methods and dataset are beneficial to a broad community as dToF depth sensing is becoming mainstream on mobile devices. Our code and data are publicly available: https://github.com/facebookresearch/DVSR/ △ Less

Submitted 3 May, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

arXiv:2210.13994 [pdf, other]

Minutiae-Guided Fingerprint Embeddings via Vision Transformers

Authors: Steven A. Grosz, Joshua J. Engelsma, Rajeev Ranjan, Naveen Ramakrishnan, Manoj Aggarwal, Gerard G. Medioni, Anil K. Jain

Abstract: Minutiae matching has long dominated the field of fingerprint recognition. However, deep networks can be used to extract fixed-length embeddings from fingerprints. To date, the few studies that have explored the use of CNN architectures to extract such embeddings have shown extreme promise. Inspired by these early works, we propose the first use of a Vision Transformer (ViT) to learn a discriminat… ▽ More Minutiae matching has long dominated the field of fingerprint recognition. However, deep networks can be used to extract fixed-length embeddings from fingerprints. To date, the few studies that have explored the use of CNN architectures to extract such embeddings have shown extreme promise. Inspired by these early works, we propose the first use of a Vision Transformer (ViT) to learn a discriminative fixed-length fingerprint embedding. We further demonstrate that by guiding the ViT to focus in on local, minutiae related features, we can boost the recognition performance. Finally, we show that by fusing embeddings learned by CNNs and ViTs we can reach near parity with a commercial state-of-the-art (SOTA) matcher. In particular, we obtain a TAR=94.23% @ FAR=0.1% on the NIST SD 302 public-domain dataset, compared to a SOTA commercial matcher which obtains TAR=96.71% @ FAR=0.1%. Additionally, our fixed-length embeddings can be matched orders of magnitude faster than the commercial system (2.5 million matches/second compared to 50K matches/second). We make our code and models publicly available to encourage further research on this topic: https://github.com/tba. △ Less

Submitted 25 October, 2022; v1 submitted 25 October, 2022; originally announced October 2022.

Showing 1–50 of 168 results for author: Ranjan, R