Search | arXiv e-print repository

Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Authors: Harit Vishwakarma, Reid, Chen, Sui Jiet Tay, Satya Sai Srinath Namburi, Frederic Sala, Ramya Korlakai Vinayak

Abstract: Auto-labeling is an important family of techniques that produce labeled training sets with minimum manual labeling. A prominent variant, threshold-based auto-labeling (TBAL), works by finding a threshold on a model's confidence scores above which it can accurately label unlabeled data points. However, many models are known to produce overconfident scores, leading to poor TBAL performance. While a… ▽ More Auto-labeling is an important family of techniques that produce labeled training sets with minimum manual labeling. A prominent variant, threshold-based auto-labeling (TBAL), works by finding a threshold on a model's confidence scores above which it can accurately label unlabeled data points. However, many models are known to produce overconfident scores, leading to poor TBAL performance. While a natural idea is to apply off-the-shelf calibration methods to alleviate the overconfidence issue, such methods still fall short. Rather than experimenting with ad-hoc choices of confidence functions, we propose a framework for studying the \emph{optimal} TBAL confidence function. We develop a tractable version of the framework to obtain \texttt{Colander} (Confidence functions for Efficient and Reliable Auto-labeling), a new post-hoc method specifically designed to maximize performance in TBAL systems. We perform an extensive empirical evaluation of our method \texttt{Colander} and compare it against methods designed for calibration. \texttt{Colander} achieves up to 60\% improvements on coverage over the baselines while maintaining auto-labeling error below $5\%$ and using the same amount of labeled data as the baselines. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2401.12225 [pdf, other]

Multimodal Data Curation via Object Detection and Filter Ensembles

Authors: Tzu-Heng Huang, Changho Shin, Sui Jiet Tay, Dyah Adila, Frederic Sala

Abstract: We propose an approach for curating multimodal data that we used for our entry in the 2023 DataComp competition filtering track. Our technique combines object detection and weak supervision-based ensembling. In the first of two steps in our approach, we employ an out-of-the-box zero-shot object detection model to extract granular information and produce a variety of filter designs. In the second s… ▽ More We propose an approach for curating multimodal data that we used for our entry in the 2023 DataComp competition filtering track. Our technique combines object detection and weak supervision-based ensembling. In the first of two steps in our approach, we employ an out-of-the-box zero-shot object detection model to extract granular information and produce a variety of filter designs. In the second step, we employ weak supervision to ensemble filtering rules. This approach results in a 4% performance improvement when compared to the best-performing baseline, producing the top-ranking position in the small scale track at the time of writing. Furthermore, in the medium scale track, we achieve a noteworthy 4.2% improvement over the baseline by simply ensembling existing baselines with weak supervision. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: Appeared in the Workshop of Towards the Next Generation of Computer Vision Datasets (TNGCV) on ICCV 2023

arXiv:2311.01195 [pdf, other]

Batch Bayesian Optimization for Replicable Experimental Design

Authors: Zhongxiang Dai, Quoc Phong Nguyen, Sebastian Shenghong Tay, Daisuke Urano, Richalynn Leong, Bryan Kian Hsiang Low, Patrick Jaillet

Abstract: Many real-world experimental design problems (a) evaluate multiple experimental conditions in parallel and (b) replicate each condition multiple times due to large and heteroscedastic observation noise. Given a fixed total budget, this naturally induces a trade-off between evaluating more unique conditions while replicating each of them fewer times vs. evaluating fewer unique conditions and replic… ▽ More Many real-world experimental design problems (a) evaluate multiple experimental conditions in parallel and (b) replicate each condition multiple times due to large and heteroscedastic observation noise. Given a fixed total budget, this naturally induces a trade-off between evaluating more unique conditions while replicating each of them fewer times vs. evaluating fewer unique conditions and replicating each more times. Moreover, in these problems, practitioners may be risk-averse and hence prefer an input with both good average performance and small variability. To tackle both challenges, we propose the Batch Thompson Sampling for Replicable Experimental Design (BTS-RED) framework, which encompasses three algorithms. Our BTS-RED-Known and BTS-RED-Unknown algorithms, for, respectively, known and unknown noise variance, choose the number of replications adaptively rather than deterministically such that an input with a larger noise variance is replicated more times. As a result, despite the noise heteroscedasticity, both algorithms enjoy a theoretical guarantee and are asymptotically no-regret. Our Mean-Var-BTS-RED algorithm aims at risk-averse optimization and is also asymptotically no-regret. We also show the effectiveness of our algorithms in two practical real-world applications: precision agriculture and AutoML. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: Accepted to NeurIPS 2023

arXiv:2307.03291 [pdf]

A Multi-Factor Homomorphic Encryption based Method for Authenticated Access to IoT Devices

Authors: Salem AlJanah, Ning Zhang, Siok Wah Tay

Abstract: Authentication is the first defence mechanism in many electronic systems, including Internet of Things (IoT) applications, as it is essential for other security services such as intrusion detection. As existing authentication solutions proposed for IoT environments do not provide multi-level authentication assurance, particularly for device-to-device authentication scenarios, we recently proposed… ▽ More Authentication is the first defence mechanism in many electronic systems, including Internet of Things (IoT) applications, as it is essential for other security services such as intrusion detection. As existing authentication solutions proposed for IoT environments do not provide multi-level authentication assurance, particularly for device-to-device authentication scenarios, we recently proposed the M2I (Multi-Factor Multi-Level and Interaction based Authentication) framework to facilitate multi-factor authentication of devices in device-to-device and device-to-multiDevice interactions. In this paper, we extend the framework to address group authentication. Two Many-to-One (M2O) protocols are proposed, the Hybrid Group Authentication and Key Acquisition (HGAKA) protocol and the Hybrid Group Access (HGA) protocol. The protocols use a combination of symmetric and asymmetric cryptographic primitives to facilitate multifactor group authentication. The informal analysis and formal security verification show that the protocols satisfy the desirable security requirements and are secure against authentication attacks. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2201.07323 [pdf]

doi 10.1109/ACCESS.2022.3170844

A Multi-factor Multi-level and Interaction based (M2I) Authentication Framework for Internet of Things (IoT) Applications

Authors: Salem AlJanah, Ning Zhang, Siok Wah Tay

Abstract: Existing authentication solutions proposed for Internet of Things (IoT) provide a single Level of Assurance (LoA) regardless of the sensitivity levels of the resources or interactions between IoT devices being protected. For effective (with adequate level of protection) and efficient (with as low overhead costs as possible) protections, it may be desirable to tailor the protection level in respons… ▽ More Existing authentication solutions proposed for Internet of Things (IoT) provide a single Level of Assurance (LoA) regardless of the sensitivity levels of the resources or interactions between IoT devices being protected. For effective (with adequate level of protection) and efficient (with as low overhead costs as possible) protections, it may be desirable to tailor the protection level in response to the sensitivity level of the resources, as a stronger protection level typically imposes a higher level of overheads costs. In this paper, we investigate how to facilitate multi-LoA authentication for IoT by proposing a multi-factor multi-level and interaction based (M2I) authentication framework. The framework implements LoA linked and interaction based authentication. Two interaction modes are investigated, P2P (Peer-to-Peer) and O2M (One-to-Many) via the design of two corresponding protocols. Evaluation results show that adopting the O2M interaction mode in authentication can cut communication cost significantly; compared with that of the Kerberos protocol, the O2M protocol reduces the communication cost by 42% ~ 45%. The protocols also introduce less computational cost. The P2P and O2M protocol, respectively, reduce the computational cost by 70% ~ 72% and 81% ~ 82% in comparison with that of Kerberos. Evaluation results also show that the two factor authentication option costs twice as much as that of the one-factor option. △ Less

Submitted 17 March, 2022; v1 submitted 18 January, 2022; originally announced January 2022.

Journal ref: IEEE Access, vol. 10, 2022

arXiv:2112.09327 [pdf, other]

Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards

Authors: Sebastian Shenghong Tay, Xinyi Xu, Chuan Sheng Foo, Bryan Kian Hsiang Low

Abstract: This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e.g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions. Distributing synthetic data as rewards (instead of trained models or mo… ▽ More This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e.g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions. Distributing synthetic data as rewards (instead of trained models or money) offers task- and model-agnostic benefits for downstream learning tasks and is less likely to violate data privacy regulation. To realize the framework, we firstly propose a data valuation function using maximum mean discrepancy (MMD) that values data based on its quantity and quality in terms of its closeness to the true data distribution and provide theoretical results guiding the kernel choice in our MMD-based data valuation function. Then, we formulate the reward scheme as a linear optimization problem that when solved, guarantees certain incentives such as fairness in the CGM framework. We devise a weighted sampling algorithm for generating synthetic data to be distributed to each party as reward such that the value of its data and the synthetic data combined matches its assigned reward value by the reward scheme. We empirically show using simulated and real-world datasets that the parties' synthetic data rewards are commensurate to their contributions. △ Less

Submitted 17 December, 2021; originally announced December 2021.

Comments: 36th AAAI Conference on Artificial Intelligence (AAAI 2022), Extended version with derivations, 42 pages

arXiv:2012.10688 [pdf, other]

Top-$k$ Ranking Bayesian Optimization

Authors: Quoc Phong Nguyen, Sebastian Tay, Bryan Kian Hsiang Low, Patrick Jaillet

Abstract: This paper presents a novel approach to top-$k$ ranking Bayesian optimization (top-$k$ ranking BO) which is a practical and significant generalization of preferential BO to handle top-$k$ ranking and tie/indifference observations. We first design a surrogate model that is not only capable of catering to the above observations, but is also supported by a classic random utility model. Another equall… ▽ More This paper presents a novel approach to top-$k$ ranking Bayesian optimization (top-$k$ ranking BO) which is a practical and significant generalization of preferential BO to handle top-$k$ ranking and tie/indifference observations. We first design a surrogate model that is not only capable of catering to the above observations, but is also supported by a classic random utility model. Another equally important contribution is the introduction of the first information-theoretic acquisition function in BO with preferential observation called multinomial predictive entropy search (MPES) which is flexible in handling these observations and optimized for all inputs of a query jointly. MPES possesses superior performance compared with existing acquisition functions that select the inputs of a query one at a time greedily. We empirically evaluate the performance of MPES using several synthetic benchmark functions, CIFAR-$10$ dataset, and SUSHI preference dataset. △ Less

Submitted 19 December, 2020; originally announced December 2020.

Comments: 35th AAAI Conference on Artificial Intelligence (AAAI 2021), Extended version with derivations, 13 pages

arXiv:2008.07030 [pdf, other]

Training CNN Classifiers for Semantic Segmentation using Partially Annotated Images: with Application on Human Thigh and Calf MRI

Authors: Chun Kit Wong, Stephanie Marchesseau, Maria Kalimeri, Tiang Siew Yap, Serena S. H. Teo, Lingaraj Krishna, Alfredo Franco-Obregón, Stacey K. H. Tay, Chin Meng Khoo, Philip T. H. Lee, Melvin K. S. Leow, John J. Totman, Mary C. Stephenson

Abstract: Objective: Medical image datasets with pixel-level labels tend to have a limited number of organ or tissue label classes annotated, even when the images have wide anatomical coverage. With supervised learning, multiple classifiers are usually needed given these partially annotated datasets. In this work, we propose a set of strategies to train one single classifier in segmenting all label classes… ▽ More Objective: Medical image datasets with pixel-level labels tend to have a limited number of organ or tissue label classes annotated, even when the images have wide anatomical coverage. With supervised learning, multiple classifiers are usually needed given these partially annotated datasets. In this work, we propose a set of strategies to train one single classifier in segmenting all label classes that are heterogeneously annotated across multiple datasets without moving into semi-supervised learning. Methods: Masks were first created from each label image through a process we termed presence masking. Three presence masking modes were evaluated, differing mainly in weightage assigned to the annotated and unannotated classes. These masks were then applied to the loss function during training to remove the influence of unannotated classes. Results: Evaluation against publicly available CT datasets shows that presence masking is a viable method for training class-generic classifiers. Our class-generic classifier can perform as well as multiple class-specific classifiers combined, while the training duration is similar to that required for one class-specific classifier. Furthermore, the class-generic classifier can outperform the class-specific classifiers when trained on smaller datasets. Finally, consistent results are observed from evaluations against human thigh and calf MRI datasets collected in-house. Conclusion: The evaluation outcomes show that presence masking is capable of significantly improving both training and inference efficiency across imaging modalities and anatomical regions. Improved performance may even be observed on small datasets. Significance: Presence masking strategies can reduce the computational resources and costs involved in manual medical image annotations. All codes are publicly available at https://github.com/wong-ck/DeepSegment. △ Less

Submitted 16 August, 2020; originally announced August 2020.

Comments: Submitted to IEEE Transactions on Medical Imaging (Special Issue on Annotation-Efficient Deep Learning for Medical Imaging)

arXiv:1907.09052 [pdf, ps, other]

Hardware-In-the-Loop for Connected Automated Vehicles Testing in Real Traffic

Authors: Yeojun Kim, Samuel Tay, Jacopo Guanetti, Francesco Borrelli, Ryan Miller

Abstract: We present a hardware-in-the-loop (HIL) simulation setup for repeatable testing of Connected Automated Vehicles (CAVs) in dynamic, real-world scenarios. Our goal is to test control and planning algorithms and their distributed implementation on the vehicle hardware and, possibly, in the cloud. The HIL setup combines PreScan for perception sensors, road topography, and signalized intersections; Vis… ▽ More We present a hardware-in-the-loop (HIL) simulation setup for repeatable testing of Connected Automated Vehicles (CAVs) in dynamic, real-world scenarios. Our goal is to test control and planning algorithms and their distributed implementation on the vehicle hardware and, possibly, in the cloud. The HIL setup combines PreScan for perception sensors, road topography, and signalized intersections; Vissim for traffic micro-simulation; ETAS DESK-LABCAR/a dynamometer for vehicle and powertrain dynamics; and on-board electronic control units for CAV real time control. Models of traffic and signalized intersections are driven by real-world measurements. To demonstrate this HIL simulation setup, we test a Model Predictive Control approach for maximizing energy efficiency of CAVs in urban environments. △ Less

Submitted 21 July, 2019; originally announced July 2019.

Comments: This work was presented at the 14th International Symposium in Advanced Vehicle Control (AVEC '18)

Showing 1–9 of 9 results for author: Tay, S