Search | arXiv e-print repository

Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

Authors: Athanasios Tragakis, Marco Aversa, Chaitanya Kaul, Roderick Murray-Smith, Daniele Faccio

Abstract: In this work, we introduce Pixelsmith, a zero-shot text-to-image generative framework to sample images at higher resolutions with a single GPU. We are the first to show that it is possible to scale the output of a pre-trained diffusion model by a factor of 1000, opening the road for gigapixel image generation at no additional cost. Our cascading method uses the image generated at the lowest resolu… ▽ More In this work, we introduce Pixelsmith, a zero-shot text-to-image generative framework to sample images at higher resolutions with a single GPU. We are the first to show that it is possible to scale the output of a pre-trained diffusion model by a factor of 1000, opening the road for gigapixel image generation at no additional cost. Our cascading method uses the image generated at the lowest resolution as a baseline to sample at higher resolutions. For the guidance, we introduce the Slider, a tunable mechanism that fuses the overall structure contained in the first-generated image with enhanced fine details. At each inference step, we denoise patches rather than the entire latent space, minimizing memory demands such that a single GPU can handle the process, regardless of the image's resolution. Our experimental results show that Pixelsmith not only achieves higher quality and diversity compared to existing techniques, but also reduces sampling time and artifacts. The code for our work is available at https://github.com/Thanos-DB/Pixelsmith. △ Less

Submitted 12 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

arXiv:2403.00396 [pdf, other]

doi 10.1109/ISBI56570.2024.10635344

GLFNET: Global-Local (frequency) Filter Networks for efficient medical image segmentation

Authors: Athanasios Tragakis, Qianying Liu, Chaitanya Kaul, Swalpa Kumar Roy, Hang Dai, Fani Deligianni, Roderick Murray-Smith, Daniele Faccio

Abstract: We propose a novel transformer-style architecture called Global-Local Filter Network (GLFNet) for medical image segmentation and demonstrate its state-of-the-art performance. We replace the self-attention mechanism with a combination of global-local filter blocks to optimize model efficiency. The global filters extract features from the whole feature map whereas the local filters are being adaptiv… ▽ More We propose a novel transformer-style architecture called Global-Local Filter Network (GLFNet) for medical image segmentation and demonstrate its state-of-the-art performance. We replace the self-attention mechanism with a combination of global-local filter blocks to optimize model efficiency. The global filters extract features from the whole feature map whereas the local filters are being adaptively created as 4x4 patches of the same feature map and add restricted scale information. In particular, the feature extraction takes place in the frequency domain rather than the commonly used spatial (image) domain to facilitate faster computations. The fusion of information from both spatial and frequency spaces creates an efficient model with regards to complexity, required data and performance. We test GLFNet on three benchmark datasets achieving state-of-the-art performance on all of them while being almost twice as efficient in terms of GFLOP operations. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Journal ref: 2024 IEEE International Symposium on Biomedical Imaging (ISBI)

arXiv:2310.20168 [pdf, other]

Understanding and Visualizing Droplet Distributions in Simulations of Shallow Clouds

Authors: Justus C. Will, Andrea M. Jenney, Kara D. Lamb, Michael S. Pritchard, Colleen Kaul, Po-Lun Ma, Kyle Pressel, Jacob Shpund, Marcus van Lier-Walqui, Stephan Mandt

Abstract: Thorough analysis of local droplet-level interactions is crucial to better understand the microphysical processes in clouds and their effect on the global climate. High-accuracy simulations of relevant droplet size distributions from Large Eddy Simulations (LES) of bin microphysics challenge current analysis techniques due to their high dimensionality involving three spatial dimensions, time, and… ▽ More Thorough analysis of local droplet-level interactions is crucial to better understand the microphysical processes in clouds and their effect on the global climate. High-accuracy simulations of relevant droplet size distributions from Large Eddy Simulations (LES) of bin microphysics challenge current analysis techniques due to their high dimensionality involving three spatial dimensions, time, and a continuous range of droplet sizes. Utilizing the compact latent representations from Variational Autoencoders (VAEs), we produce novel and intuitive visualizations for the organization of droplet sizes and their evolution over time beyond what is possible with clustering techniques. This greatly improves interpretation and allows us to examine aerosol-cloud interactions by contrasting simulations with different aerosol concentrations. We find that the evolution of the droplet spectrum is similar across aerosol levels but occurs at different paces. This similarity suggests that precipitation initiation processes are alike despite variations in onset times. △ Less

Submitted 31 October, 2023; originally announced October 2023.

Comments: 4 pages, 3 figures, accepted at NeurIPS 2023 (Machine Learning and the Physical Sciences Workshop)

arXiv:2309.07096 [pdf]

doi 10.1016/j.neuroimage.2024.120600

Computational limits to the legibility of the imaged human brain

Authors: James K Ruffle, Robert J Gray, Samia Mohinta, Guilherme Pombo, Chaitanya Kaul, Harpreet Hyare, Geraint Rees, Parashkev Nachev

Abstract: Our knowledge of the organisation of the human brain at the population-level is yet to translate into power to predict functional differences at the individual-level, limiting clinical applications, and casting doubt on the generalisability of inferred mechanisms. It remains unknown whether the difficulty arises from the absence of individuating biological patterns within the brain, or from limite… ▽ More Our knowledge of the organisation of the human brain at the population-level is yet to translate into power to predict functional differences at the individual-level, limiting clinical applications, and casting doubt on the generalisability of inferred mechanisms. It remains unknown whether the difficulty arises from the absence of individuating biological patterns within the brain, or from limited power to access them with the models and compute at our disposal. Here we comprehensively investigate the resolvability of such patterns with data and compute at unprecedented scale. Across 23 810 unique participants from UK Biobank, we systematically evaluate the predictability of 25 individual biological characteristics, from all available combinations of structural and functional neuroimaging data. Over 4526 GPU hours of computation, we train, optimize, and evaluate out-of-sample 700 individual predictive models, including fully-connected feed-forward neural networks of demographic, psychological, serological, chronic disease, and functional connectivity characteristics, and both uni- and multi-modal 3D convolutional neural network models of macro- and micro-structural brain imaging. We find a marked discrepancy between the high predictability of sex (balanced accuracy 99.7%), age (mean absolute error 2.048 years, R2 0.859), and weight (mean absolute error 2.609Kg, R2 0.625), for which we set new state-of-the-art performance, and the surprisingly low predictability of other characteristics. Neither structural nor functional imaging predicted psychology better than the coincidence of chronic disease (p<0.05). Serology predicted chronic disease (p<0.05) and was best predicted by it (p<0.001), followed by structural neuroimaging (p<0.05). Our findings suggest either more informative imaging or more powerful models are needed to decipher individual level characteristics from the human brain. △ Less

Submitted 2 April, 2024; v1 submitted 23 August, 2023; originally announced September 2023.

Comments: 38 pages, 6 figures, 1 table, 2 supplementary figures, 1 supplementary table

arXiv:2302.14625 [pdf, other]

mmSense: Detecting Concealed Weapons with a Miniature Radar Sensor

Authors: Kevin Mitchell, Khaled Kassem, Chaitanya Kaul, Valentin Kapitany, Philip Binner, Andrew Ramsay, Roderick Murray-Smith, Daniele Faccio

Abstract: For widespread adoption, public security and surveillance systems must be accurate, portable, compact, and real-time, without impeding the privacy of the individuals being observed. Current systems broadly fall into two categories -- image-based which are accurate, but lack privacy, and RF signal-based, which preserve privacy but lack portability, compactness and accuracy. Our paper proposes mmSen… ▽ More For widespread adoption, public security and surveillance systems must be accurate, portable, compact, and real-time, without impeding the privacy of the individuals being observed. Current systems broadly fall into two categories -- image-based which are accurate, but lack privacy, and RF signal-based, which preserve privacy but lack portability, compactness and accuracy. Our paper proposes mmSense, an end-to-end portable miniaturised real-time system that can accurately detect the presence of concealed metallic objects on persons in a discrete, privacy-preserving modality. mmSense features millimeter wave radar technology, provided by Google's Soli sensor for its data acquisition, and TransDope, our real-time neural network, capable of processing a single radar data frame in 19 ms. mmSense achieves high recognition rates on a diverse set of challenging scenes while running on standard laptop hardware, demonstrating a significant advancement towards creating portable, cost-effective real-time radar based surveillance systems. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted by ICASSP 2023

arXiv:2302.14566 [pdf, other]

Continuous interaction with a smart speaker via low-dimensional embeddings of dynamic hand pose

Authors: Songpei Xu, Chaitanya Kaul, Xuri Ge, Roderick Murray-Smith

Abstract: This paper presents a new continuous interaction strategy with visual feedback of hand pose and mid-air gesture recognition and control for a smart music speaker, which utilizes only 2 video frames to recognize gestures. Frame-based hand pose features from MediaPipe Hands, containing 21 landmarks, are embedded into a 2 dimensional pose space by an autoencoder. The corresponding space for interacti… ▽ More This paper presents a new continuous interaction strategy with visual feedback of hand pose and mid-air gesture recognition and control for a smart music speaker, which utilizes only 2 video frames to recognize gestures. Frame-based hand pose features from MediaPipe Hands, containing 21 landmarks, are embedded into a 2 dimensional pose space by an autoencoder. The corresponding space for interaction with the music content is created by embedding high-dimensional music track profiles to a compatible two-dimensional embedding. A PointNet-based model is then applied to classify gestures which are used to control the device interaction or explore music spaces. By jointly optimising the autoencoder with the classifier, we manage to learn a more useful embedding space for discriminating gestures. We demonstrate the functionality of the system with experienced users selecting different musical moods by varying their hand pose. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted at ICASSP 2023

arXiv:2210.08066 [pdf, other]

doi 10.1109/ICASSP49357.2023.10096379

Optimizing Vision Transformers for Medical Image Segmentation

Authors: Qianying Liu, Chaitanya Kaul, Jun Wang, Christos Anagnostopoulos, Roderick Murray-Smith, Fani Deligianni

Abstract: For medical image semantic segmentation (MISS), Vision Transformers have emerged as strong alternatives to convolutional neural networks thanks to their inherent ability to capture long-range correlations. However, existing research uses off-the-shelf vision Transformer blocks based on linear projections and feature processing which lack spatial and local context to refine organ boundaries. Furthe… ▽ More For medical image semantic segmentation (MISS), Vision Transformers have emerged as strong alternatives to convolutional neural networks thanks to their inherent ability to capture long-range correlations. However, existing research uses off-the-shelf vision Transformer blocks based on linear projections and feature processing which lack spatial and local context to refine organ boundaries. Furthermore, Transformers do not generalize well on small medical imaging datasets and rely on large-scale pre-training due to limited inductive biases. To address these problems, we demonstrate the design of a compact and accurate Transformer network for MISS, CS-Unet, which introduces convolutions in a multi-stage design for hierarchically enhancing spatial and local modeling ability of Transformers. This is mainly achieved by our well-designed Convolutional Swin Transformer (CST) block which merges convolutions with Multi-Head Self-Attention and Feed-Forward Networks for providing inherent localized spatial context and inductive biases. Experiments demonstrate CS-Unet without pre-training outperforms other counterparts by large margins on multi-organ and cardiac datasets with fewer parameters and achieves state-of-the-art performance. Our code is available at Github. △ Less

Submitted 26 October, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

arXiv:2206.00566 [pdf]

The Fully Convolutional Transformer for Medical Image Segmentation

Authors: Athanasios Tragakis, Chaitanya Kaul, Roderick Murray-Smith, Dirk Husmeier

Abstract: We propose a novel transformer model, capable of segmenting medical images of varying modalities. Challenges posed by the fine grained nature of medical image analysis mean that the adaptation of the transformer for their analysis is still at nascent stages. The overwhelming success of the UNet lay in its ability to appreciate the fine-grained nature of the segmentation task, an ability which exis… ▽ More We propose a novel transformer model, capable of segmenting medical images of varying modalities. Challenges posed by the fine grained nature of medical image analysis mean that the adaptation of the transformer for their analysis is still at nascent stages. The overwhelming success of the UNet lay in its ability to appreciate the fine-grained nature of the segmentation task, an ability which existing transformer based models do not currently posses. To address this shortcoming, we propose The Fully Convolutional Transformer (FCT), which builds on the proven ability of Convolutional Neural Networks to learn effective image representations, and combines them with the ability of Transformers to effectively capture long-term dependencies in its inputs. The FCT is the first fully convolutional Transformer model in medical imaging literature. It processes its input in two stages, where first, it learns to extract long range semantic dependencies from the input image, and then learns to capture hierarchical global attributes from the features. FCT is compact, accurate and robust. Our results show that it outperforms all existing transformer architectures by large margins across multiple medical image segmentation datasets of varying data modalities without the need for any pre-training. FCT outperforms its immediate competitor on the ACDC dataset by 1.3%, on the Synapse dataset by 4.4%, on the Spleen dataset by 1.2% and on ISIC 2017 dataset by 1.1% on the dice metric, with up to five times fewer parameters. Our code, environments and models will be available via GitHub. △ Less

Submitted 29 January, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

Journal ref: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 3660-3669

arXiv:2111.13023 [pdf, other]

Rotation Equivariant 3D Hand Mesh Generation from a Single RGB Image

Authors: Joshua Mitton, Chaitanya Kaul, Roderick Murray-Smith

Abstract: We develop a rotation equivariant model for generating 3D hand meshes from 2D RGB images. This guarantees that as the input image of a hand is rotated the generated mesh undergoes a corresponding rotation. Furthermore, this removes undesirable deformations in the meshes often generated by methods without rotation equivariance. By building a rotation equivariant model, through considering symmetrie… ▽ More We develop a rotation equivariant model for generating 3D hand meshes from 2D RGB images. This guarantees that as the input image of a hand is rotated the generated mesh undergoes a corresponding rotation. Furthermore, this removes undesirable deformations in the meshes often generated by methods without rotation equivariance. By building a rotation equivariant model, through considering symmetries in the problem, we reduce the need for training on very large datasets to achieve good mesh reconstruction. The encoder takes images defined on $\mathbb{Z}^{2}$ and maps these to latent functions defined on the group $C_{8}$. We introduce a novel vector mapping function to map the function defined on $C_{8}$ to a latent point cloud space defined on the group $\mathrm{SO}(2)$. Further, we introduce a 3D projection function that learns a 3D function from the $\mathrm{SO}(2)$ latent space. Finally, we use an $\mathrm{SO}(3)$ equivariant decoder to ensure rotation equivariance. Our rotation equivariant model outperforms state-of-the-art methods on a real-world dataset and we demonstrate that it accurately captures the shape and pose in the generated meshes under rotation of the input hand. △ Less

Submitted 25 November, 2021; originally announced November 2021.

arXiv:2111.10866 [pdf, other]

CpT: Convolutional Point Transformer for 3D Point Cloud Processing

Authors: Chaitanya Kaul, Joshua Mitton, Hang Dai, Roderick Murray-Smith

Abstract: We present CpT: Convolutional point Transformer - a novel deep learning architecture for dealing with the unstructured nature of 3D point cloud data. CpT is an improvement over existing attention-based Convolutions Neural Networks as well as previous 3D point cloud processing transformers. It achieves this feat due to its effectiveness in creating a novel and robust attention-based point set embed… ▽ More We present CpT: Convolutional point Transformer - a novel deep learning architecture for dealing with the unstructured nature of 3D point cloud data. CpT is an improvement over existing attention-based Convolutions Neural Networks as well as previous 3D point cloud processing transformers. It achieves this feat due to its effectiveness in creating a novel and robust attention-based point set embedding through a convolutional projection layer crafted for processing dynamically local point set neighbourhoods. The resultant point set embedding is robust to the permutations of the input points. Our novel CpT block builds over local neighbourhoods of points obtained via a dynamic graph computation at each layer of the networks' structure. It is fully differentiable and can be stacked just like convolutional layers to learn global properties of the points. We evaluate our model on standard benchmark datasets such as ModelNet40, ShapeNet Part Segmentation, and the S3DIS 3D indoor scene semantic segmentation dataset to show that our model can serve as an effective backbone for various point cloud processing tasks when compared to the existing state-of-the-art approaches. △ Less

Submitted 21 November, 2021; originally announced November 2021.

arXiv:2107.01614 [pdf, other]

Survey: Leakage and Privacy at Inference Time

Authors: Marija Jegorova, Chaitanya Kaul, Charlie Mayor, Alison Q. O'Neil, Alexander Weir, Roderick Murray-Smith, Sotirios A. Tsaftaris

Abstract: Leakage of data from publicly available Machine Learning (ML) models is an area of growing significance as commercial and government applications of ML can draw on multiple sources of data, potentially including users' and clients' sensitive data. We provide a comprehensive survey of contemporary advances on several fronts, covering involuntary data leakage which is natural to ML models, potential… ▽ More Leakage of data from publicly available Machine Learning (ML) models is an area of growing significance as commercial and government applications of ML can draw on multiple sources of data, potentially including users' and clients' sensitive data. We provide a comprehensive survey of contemporary advances on several fronts, covering involuntary data leakage which is natural to ML models, potential malevolent leakage which is caused by privacy attacks, and currently available defence mechanisms. We focus on inference-time leakage, as the most likely scenario for publicly available models. We first discuss what leakage is in the context of different data, tasks, and model architectures. We then propose a taxonomy across involuntary and malevolent leakage, available defences, followed by the currently available assessment metrics and applications. We conclude with outstanding challenges and open questions, outlining some promising directions for future research. △ Less

Submitted 9 September, 2022; v1 submitted 4 July, 2021; originally announced July 2021.

arXiv:2104.03427 [pdf, other]

FatNet: A Feature-attentive Network for 3D Point Cloud Processing

Authors: Chaitanya Kaul, Nick Pears, Suresh Manandhar

Abstract: The application of deep learning to 3D point clouds is challenging due to its lack of order. Inspired by the point embeddings of PointNet and the edge embeddings of DGCNNs, we propose three improvements to the task of point cloud analysis. First, we introduce a novel feature-attentive neural network layer, a FAT layer, that combines both global point-based features and local edge-based features in… ▽ More The application of deep learning to 3D point clouds is challenging due to its lack of order. Inspired by the point embeddings of PointNet and the edge embeddings of DGCNNs, we propose three improvements to the task of point cloud analysis. First, we introduce a novel feature-attentive neural network layer, a FAT layer, that combines both global point-based features and local edge-based features in order to generate better embeddings. Second, we find that applying the same attention mechanism across two different forms of feature map aggregation, max pooling and average pooling, gives better performance than either alone. Third, we observe that residual feature reuse in this setting propagates information more effectively between the layers, and makes the network easier to train. Our architecture achieves state-of-the-art results on the task of point cloud classification, as demonstrated on the ModelNet40 dataset, and an extremely competitive performance on the ShapeNet part segmentation challenge. △ Less

Submitted 7 April, 2021; originally announced April 2021.

Comments: Published at ICPR 2020 (Oral). arXiv admin note: substantial text overlap with arXiv:1905.07650

arXiv:1912.02079 [pdf, other]

FocusNet++: Attentive Aggregated Transformations for Efficient and Accurate Medical Image Segmentation

Authors: Chaitanya Kaul, Nick Pears, Hang Dai, Roderick Murray-Smith, Suresh Manandhar

Abstract: We propose a new residual block for convolutional neural networks and demonstrate its state-of-the-art performance in medical image segmentation. We combine attention mechanisms with group convolutions to create our group attention mechanism, which forms the fundamental building block of our network, FocusNet++. We employ a hybrid loss based on balanced cross entropy, Tversky loss and the adaptive… ▽ More We propose a new residual block for convolutional neural networks and demonstrate its state-of-the-art performance in medical image segmentation. We combine attention mechanisms with group convolutions to create our group attention mechanism, which forms the fundamental building block of our network, FocusNet++. We employ a hybrid loss based on balanced cross entropy, Tversky loss and the adaptive logarithmic loss to enhance the performance along with fast convergence. Our results show that FocusNet++ achieves state-of-the-art results across various benchmark metrics for the ISIC 2018 melanoma segmentation and the cell nuclei segmentation datasets with fewer parameters and FLOPs. △ Less

Submitted 7 April, 2021; v1 submitted 4 December, 2019; originally announced December 2019.

Comments: Published at ISBI 2021

arXiv:1910.09717 [pdf, other]

Penalizing small errors using an Adaptive Logarithmic Loss

Authors: Chaitanya Kaul, Nick Pears, Hang Dai, Roderick Murray-Smith, Suresh Manandhar

Abstract: Loss functions are error metrics that quantify the difference between a prediction and its corresponding ground truth. Fundamentally, they define a functional landscape for traversal by gradient descent. Although numerous loss functions have been proposed to date in order to handle various machine learning problems, little attention has been given to enhancing these functions to better traverse th… ▽ More Loss functions are error metrics that quantify the difference between a prediction and its corresponding ground truth. Fundamentally, they define a functional landscape for traversal by gradient descent. Although numerous loss functions have been proposed to date in order to handle various machine learning problems, little attention has been given to enhancing these functions to better traverse the loss landscape. In this paper, we simultaneously and significantly mitigate two prominent problems in medical image segmentation namely: i) class imbalance between foreground and background pixels and ii) poor loss function convergence. To this end, we propose an adaptive logarithmic loss function. We compare this loss function with the existing state-of-the-art on the ISIC 2018 dataset, the nuclei segmentation dataset as well as the DRIVE retinal vessel segmentation dataset. We measure the performance of our methodology on benchmark metrics and demonstrate state-of-the-art performance. More generally, we show that our system can be used as a framework for better training of deep neural networks. △ Less

Submitted 7 April, 2021; v1 submitted 21 October, 2019; originally announced October 2019.

Comments: Published at AIHA 2020 (ICPR 2020 Workshop)

arXiv:1905.07650 [pdf, other]

SAWNet: A Spatially Aware Deep Neural Network for 3D Point Cloud Processing

Authors: Chaitanya Kaul, Nick Pears, Suresh Manandhar

Abstract: Deep neural networks have established themselves as the state-of-the-art methodology in almost all computer vision tasks to date. But their application to processing data lying on non-Euclidean domains is still a very active area of research. One such area is the analysis of point cloud data which poses a challenge due to its lack of order. Many recent techniques have been proposed, spearheaded by… ▽ More Deep neural networks have established themselves as the state-of-the-art methodology in almost all computer vision tasks to date. But their application to processing data lying on non-Euclidean domains is still a very active area of research. One such area is the analysis of point cloud data which poses a challenge due to its lack of order. Many recent techniques have been proposed, spearheaded by the PointNet architecture. These techniques use either global or local information from the point clouds to extract a latent representation for the points, which is then used for the task at hand (classification/segmentation). In our work, we introduce a neural network layer that combines both global and local information to produce better embeddings of these points. We enhance our architecture with residual connections, to pass information between the layers, which also makes the network easier to train. We achieve state-of-the-art results on the ModelNet40 dataset with our architecture, and our results are also highly competitive with the state-of-the-art on the ShapeNet part segmentation dataset and the indoor scene segmentation dataset. We plan to open source our pre-trained models on github to encourage the research community to test our networks on their data, or simply use them for benchmarking purposes. △ Less

Submitted 18 May, 2019; originally announced May 2019.

arXiv:1902.03091 [pdf, other]

FocusNet: An attention-based Fully Convolutional Network for Medical Image Segmentation

Authors: Chaitanya Kaul, Suresh Manandhar, Nick Pears

Abstract: We propose a novel technique to incorporate attention within convolutional neural networks using feature maps generated by a separate convolutional autoencoder. Our attention architecture is well suited for incorporation with deep convolutional networks. We evaluate our model on benchmark segmentation datasets in skin cancer segmentation and lung lesion segmentation. Results show highly competitiv… ▽ More We propose a novel technique to incorporate attention within convolutional neural networks using feature maps generated by a separate convolutional autoencoder. Our attention architecture is well suited for incorporation with deep convolutional networks. We evaluate our model on benchmark segmentation datasets in skin cancer segmentation and lung lesion segmentation. Results show highly competitive performance when compared with U-Net and it's residual variant. △ Less

Submitted 8 February, 2019; originally announced February 2019.

Showing 1–16 of 16 results for author: Kaul, C