-
RG-CAT: Detection Pipeline and Catalogue of Radio Galaxies in the EMU Pilot Survey
Authors:
Nikhel Gupta,
Ray P. Norris,
Zeeshan Hayder,
Minh Huynh,
Lars Petersson,
X. Rosalind Wang,
Andrew M. Hopkins,
Heinz Andernach,
Yjan Gordon,
Simone Riggi,
Miranda Yew,
Evan J. Crawford,
Bärbel Koribalski,
Miroslav D. Filipović,
Anna D. Kapinśka,
Stanislav Shabala,
Tessa Vernstrom,
Joshua R. Marvil
Abstract:
We present source detection and catalogue construction pipelines to build the first catalogue of radio galaxies from the 270 $\rm deg^2$ pilot survey of the Evolutionary Map of the Universe (EMU-PS) conducted with the Australian Square Kilometre Array Pathfinder (ASKAP) telescope. The detection pipeline uses Gal-DINO computer-vision networks (Gupta et al., 2024) to predict the categories of radio…
▽ More
We present source detection and catalogue construction pipelines to build the first catalogue of radio galaxies from the 270 $\rm deg^2$ pilot survey of the Evolutionary Map of the Universe (EMU-PS) conducted with the Australian Square Kilometre Array Pathfinder (ASKAP) telescope. The detection pipeline uses Gal-DINO computer-vision networks (Gupta et al., 2024) to predict the categories of radio morphology and bounding boxes for radio sources, as well as their potential infrared host positions. The Gal-DINO network is trained and evaluated on approximately 5,000 visually inspected radio galaxies and their infrared hosts, encompassing both compact and extended radio morphologies. We find that the Intersection over Union (IoU) for the predicted and ground truth bounding boxes is larger than 0.5 for 99% of the radio sources, and 98% of predicted host positions are within $3^{\prime \prime}$ of the ground truth infrared host in the evaluation set. The catalogue construction pipeline uses the predictions of the trained network on the radio and infrared image cutouts based on the catalogue of radio components identified using the Selavy source finder algorithm. Confidence scores of the predictions are then used to prioritize Selavy components with higher scores and incorporate them first into the catalogue. This results in identifications for a total of 211,625 radio sources, with 201,211 classified as compact and unresolved. The remaining 10,414 are categorized as extended radio morphologies, including 582 FR-I, 5,602 FR-II, 1,494 FR-x (uncertain whether FR-I or FR-II), 2,375 R (single-peak resolved) radio galaxies, and 361 with peculiar and other rare morphologies. We cross-match the radio sources in the catalogue with the infrared and optical catalogues, finding infrared cross-matches for 73% and photometric redshifts for 36% of the radio galaxies.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
On the Societal Impact of Open Foundation Models
Authors:
Sayash Kapoor,
Rishi Bommasani,
Kevin Klyman,
Shayne Longpre,
Ashwin Ramaswami,
Peter Cihon,
Aspen Hopkins,
Kevin Bankston,
Stella Biderman,
Miranda Bogen,
Rumman Chowdhury,
Alex Engler,
Peter Henderson,
Yacine Jernite,
Seth Lazar,
Stefano Maffulli,
Alondra Nelson,
Joelle Pineau,
Aviya Skowron,
Dawn Song,
Victor Storchan,
Daniel Zhang,
Daniel E. Ho,
Percy Liang,
Arvind Narayanan
Abstract:
Foundation models are powerful technologies: how they are released publicly directly shapes their societal impact. In this position paper, we focus on open foundation models, defined here as those with broadly available model weights (e.g. Llama 2, Stable Diffusion XL). We identify five distinctive properties (e.g. greater customizability, poor monitoring) of open foundation models that lead to bo…
▽ More
Foundation models are powerful technologies: how they are released publicly directly shapes their societal impact. In this position paper, we focus on open foundation models, defined here as those with broadly available model weights (e.g. Llama 2, Stable Diffusion XL). We identify five distinctive properties (e.g. greater customizability, poor monitoring) of open foundation models that lead to both their benefits and risks. Open foundation models present significant benefits, with some caveats, that span innovation, competition, the distribution of decision-making power, and transparency. To understand their risks of misuse, we design a risk assessment framework for analyzing their marginal risk. Across several misuse vectors (e.g. cyberattacks, bioweapons), we find that current research is insufficient to effectively characterize the marginal risk of open foundation models relative to pre-existing technologies. The framework helps explain why the marginal risk is low in some cases, clarifies disagreements about misuse risks by revealing that past work has focused on different subsets of the framework with different assumptions, and articulates a way forward for more constructive debate. Overall, our work helps support a more grounded assessment of the societal impact of open foundation models by outlining what research is needed to empirically validate their theoretical benefits and risks.
△ Less
Submitted 27 February, 2024;
originally announced March 2024.
-
Classification of compact radio sources in the Galactic plane with supervised machine learning
Authors:
S. Riggi,
G. Umana,
C. Trigilio,
C. Bordiu,
F. Bufano,
A. Ingallinera,
F. Cavallaro,
Y. Gordon,
R. P. Norris,
G. Gürkan,
P. Leto,
C. Buemi,
S. Loru,
A. M. Hopkins,
M. D. Filipović,
T. Cecconello
Abstract:
Generation of science-ready data from processed data products is one of the major challenges in next-generation radio continuum surveys with the Square Kilometre Array (SKA) and its precursors, due to the expected data volume and the need to achieve a high degree of automated processing. Source extraction, characterization, and classification are the major stages involved in this process. In this…
▽ More
Generation of science-ready data from processed data products is one of the major challenges in next-generation radio continuum surveys with the Square Kilometre Array (SKA) and its precursors, due to the expected data volume and the need to achieve a high degree of automated processing. Source extraction, characterization, and classification are the major stages involved in this process. In this work we focus on the classification of compact radio sources in the Galactic plane using both radio and infrared images as inputs. To this aim, we produced a curated dataset of ~20,000 images of compact sources of different astronomical classes, obtained from past radio and infrared surveys, and novel radio data from pilot surveys carried out with the Australian SKA Pathfinder (ASKAP). Radio spectral index information was also obtained for a subset of the data. We then trained two different classifiers on the produced dataset. The first model uses gradient-boosted decision trees and is trained on a set of pre-computed features derived from the data, which include radio-infrared colour indices and the radio spectral index. The second model is trained directly on multi-channel images, employing convolutional neural networks. Using a completely supervised procedure, we obtained a high classification accuracy (F1-score>90%) for separating Galactic objects from the extragalactic background. Individual class discrimination performances, ranging from 60% to 75%, increased by 10% when adding far-infrared and spectral index information, with extragalactic objects, PNe and HII regions identified with higher accuracies. The implemented tools and trained models were publicly released, and made available to the radioastronomical community for future application on new radio data.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
RADiff: Controllable Diffusion Models for Radio Astronomical Maps Generation
Authors:
Renato Sortino,
Thomas Cecconello,
Andrea DeMarco,
Giuseppe Fiameni,
Andrea Pilzer,
Andrew M. Hopkins,
Daniel Magro,
Simone Riggi,
Eva Sciacca,
Adriano Ingallinera,
Cristobal Bordiu,
Filomena Bufano,
Concetto Spampinato
Abstract:
Along with the nearing completion of the Square Kilometre Array (SKA), comes an increasing demand for accurate and reliable automated solutions to extract valuable information from the vast amount of data it will allow acquiring. Automated source finding is a particularly important task in this context, as it enables the detection and classification of astronomical objects. Deep-learning-based obj…
▽ More
Along with the nearing completion of the Square Kilometre Array (SKA), comes an increasing demand for accurate and reliable automated solutions to extract valuable information from the vast amount of data it will allow acquiring. Automated source finding is a particularly important task in this context, as it enables the detection and classification of astronomical objects. Deep-learning-based object detection and semantic segmentation models have proven to be suitable for this purpose. However, training such deep networks requires a high volume of labeled data, which is not trivial to obtain in the context of radio astronomy. Since data needs to be manually labeled by experts, this process is not scalable to large dataset sizes, limiting the possibilities of leveraging deep networks to address several tasks. In this work, we propose RADiff, a generative approach based on conditional diffusion models trained over an annotated radio dataset to generate synthetic images, containing radio sources of different morphologies, to augment existing datasets and reduce the problems caused by class imbalances. We also show that it is possible to generate fully-synthetic image-annotation pairs to automatically augment any annotated dataset. We evaluate the effectiveness of this approach by training a semantic segmentation model on a real dataset augmented in two ways: 1) using synthetic images obtained from real masks, and 2) generating images from synthetic semantic masks. We show an improvement in performance when applying augmentation, gaining up to 18% in performance when using real masks and 4% when augmenting with synthetic masks. Finally, we employ this model to generate large-scale radio maps with the objective of simulating Data Challenges.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Radio astronomical images object detection and segmentation: A benchmark on deep learning methods
Authors:
Renato Sortino,
Daniel Magro,
Giuseppe Fiameni,
Eva Sciacca,
Simone Riggi,
Andrea DeMarco,
Concetto Spampinato,
Andrew M. Hopkins,
Filomena Bufano,
Francesco Schillirò,
Cristobal Bordiu,
Carmelo Pino
Abstract:
In recent years, deep learning has been successfully applied in various scientific domains. Following these promising results and performances, it has recently also started being evaluated in the domain of radio astronomy. In particular, since radio astronomy is entering the Big Data era, with the advent of the largest telescope in the world - the Square Kilometre Array (SKA), the task of automati…
▽ More
In recent years, deep learning has been successfully applied in various scientific domains. Following these promising results and performances, it has recently also started being evaluated in the domain of radio astronomy. In particular, since radio astronomy is entering the Big Data era, with the advent of the largest telescope in the world - the Square Kilometre Array (SKA), the task of automatic object detection and instance segmentation is crucial for source finding and analysis. In this work, we explore the performance of the most affirmed deep learning approaches, applied to astronomical images obtained by radio interferometric instrumentation, to solve the task of automatic source detection. This is carried out by applying models designed to accomplish two different kinds of tasks: object detection and semantic segmentation. The goal is to provide an overview of existing techniques, in terms of prediction performance and computational efficiency, to scientists in the astrophysics community who would like to employ machine learning in their research.
△ Less
Submitted 25 May, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
Designing Data: Proactive Data Collection and Iteration for Machine Learning
Authors:
Aspen Hopkins,
Fred Hohman,
Luca Zappella,
Xavier Suau Cuadros,
Dominik Moritz
Abstract:
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing…
▽ More
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model using density estimation. We apply designing data to a data collection and modeling task. We find models trained on ''designed'' datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.
△ Less
Submitted 28 July, 2023; v1 submitted 24 January, 2023;
originally announced January 2023.
-
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
Authors:
Kenneth Li,
Aspen K. Hopkins,
David Bau,
Fernanda Viégas,
Hanspeter Pfister,
Martin Wattenberg
Abstract:
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple boa…
▽ More
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.
△ Less
Submitted 26 June, 2024; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Machine Learning Practices Outside Big Tech: How Resource Constraints Challenge Responsible Development
Authors:
Aspen Hopkins,
Serena Booth
Abstract:
Practitioners from diverse occupations and backgrounds are increasingly using machine learning (ML) methods. Nonetheless, studies on ML Practitioners typically draw populations from Big Tech and academia, as researchers have easier access to these communities. Through this selection bias, past research often excludes the broader, lesser-resourced ML community -- for example, practitioners working…
▽ More
Practitioners from diverse occupations and backgrounds are increasingly using machine learning (ML) methods. Nonetheless, studies on ML Practitioners typically draw populations from Big Tech and academia, as researchers have easier access to these communities. Through this selection bias, past research often excludes the broader, lesser-resourced ML community -- for example, practitioners working at startups, at non-tech companies, and in the public sector. These practitioners share many of the same ML development difficulties and ethical conundrums as their Big Tech counterparts; however, their experiences are subject to additional under-studied challenges stemming from deploying ML with limited resources, increased existential risk, and absent access to in-house research teams. We contribute a qualitative analysis of 17 interviews with stakeholders from organizations which are less represented in prior studies. We uncover a number of tensions which are introduced or exacerbated by these organizations' resource constraints -- tensions between privacy and ubiquity, resource management and performance optimization, and access and monopolization. Increased academic focus on these practitioners can facilitate a more holistic understanding of ML limitations, and so is useful for prescribing a research agenda to facilitate responsible ML development for all.
△ Less
Submitted 6 October, 2021;
originally announced October 2021.