-
ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints
Authors:
Divij Handa,
Pavel Dolin,
Shrinidhi Kumbhar,
Chitta Baral,
Tran Cao Son
Abstract:
Reasoning about actions and change (RAC) has historically driven the development of many early AI challenges, such as the frame problem, and many AI disciplines, including non-monotonic and commonsense reasoning. The role of RAC remains important even now, particularly for tasks involving dynamic environments, interactive scenarios, and commonsense reasoning. Despite the progress of Large Language…
▽ More
Reasoning about actions and change (RAC) has historically driven the development of many early AI challenges, such as the frame problem, and many AI disciplines, including non-monotonic and commonsense reasoning. The role of RAC remains important even now, particularly for tasks involving dynamic environments, interactive scenarios, and commonsense reasoning. Despite the progress of Large Language Models (LLMs) in various AI domains, their performance on RAC is underexplored. To address this gap, we introduce a new benchmark, ActionReasoningBench, encompassing 13 domains and rigorously evaluating LLMs across eight different areas of RAC. These include - Object Tracking, Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, Hallucination Detection, and Composite Questions. Furthermore, we also investigate the indirect effect of actions due to ramification constraints for every domain. Finally, we evaluate our benchmark using open-sourced and commercial state-of-the-art LLMs, including GPT-4o, Gemini-1.0-Pro, Llama2-7b-chat, Llama2-13b-chat, Llama3-8b-instruct, Gemma-2b-instruct, and Gemma-7b-instruct. Our findings indicate that these models face significant challenges across all categories included in our benchmark.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness
Authors:
Neeraj Varshney,
Pavel Dolin,
Agastya Seth,
Chitta Baral
Abstract:
As Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and ana…
▽ More
As Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and analysis over 'safety' and 'over-defensiveness.' With SODE, we study a variety of LLM defense strategies over multiple state-of-the-art LLMs, which reveals several interesting and important findings, such as (a) the widely popular 'self-checking' techniques indeed improve the safety against unsafe inputs, but this comes at the cost of extreme over-defensiveness on the safe inputs, (b) providing a safety instruction along with in-context exemplars (of both safe and unsafe inputs) consistently improves safety and also mitigates undue over-defensiveness of the models, (c) providing contextual knowledge easily breaks the safety guardrails and makes the models more vulnerable to generating unsafe responses. Overall, our work reveals numerous such critical findings that we believe will pave the way and facilitate further research in improving the safety of LLMs.
△ Less
Submitted 30 December, 2023;
originally announced January 2024.
-
FeelsGoodMan: Inferring Semantics of Twitch Neologisms
Authors:
Pavel Dolin,
Luc d'Hauthuille,
Andrea Vattani
Abstract:
Twitch chats pose a unique problem in natural language understanding due to a large presence of neologisms, specifically emotes. There are a total of 8.06 million emotes, over 400k of which were used in the week studied. There is virtually no information on the meaning or sentiment of emotes, and with a constant influx of new emotes and drift in their frequencies, it becomes impossible to maintain…
▽ More
Twitch chats pose a unique problem in natural language understanding due to a large presence of neologisms, specifically emotes. There are a total of 8.06 million emotes, over 400k of which were used in the week studied. There is virtually no information on the meaning or sentiment of emotes, and with a constant influx of new emotes and drift in their frequencies, it becomes impossible to maintain an updated manually-labeled dataset. Our paper makes a two fold contribution. First we establish a new baseline for sentiment analysis on Twitch data, outperforming the previous supervised benchmark by 7.9% points. Secondly, we introduce a simple but powerful unsupervised framework based on word embeddings and k-NN to enrich existing models with out-of-vocabulary knowledge. This framework allows us to auto-generate a pseudo-dictionary of emotes and we show that we can nearly match the supervised benchmark above even when injecting such emote knowledge into sentiment classifiers trained on extraneous datasets such as movie reviews or Twitter.
△ Less
Submitted 17 November, 2021; v1 submitted 18 August, 2021;
originally announced August 2021.
-
Refractometric sensing of Li salt with visible-light Si3N4 microdisk resonators
Authors:
C. Doolin,
P. Doolin,
B. C. Lewis,
J. P. Davis
Abstract:
We demonstrate aqueous refractive index sensing with 15 to 30 μm diameter silicon nitride microdisk resonators to detect small concentrations of Li salt. A dimpled-tapered fiber is used to couple 780 nm visible light to the microdisks, in order to perform spectroscopy their optical resonances. The dimpled fiber probe allows testing of multiple devices on a chip in a single experiment. This sensing…
▽ More
We demonstrate aqueous refractive index sensing with 15 to 30 μm diameter silicon nitride microdisk resonators to detect small concentrations of Li salt. A dimpled-tapered fiber is used to couple 780 nm visible light to the microdisks, in order to perform spectroscopy their optical resonances. The dimpled fiber probe allows testing of multiple devices on a chip in a single experiment. This sensing system is versatile and easy to use, while remaining competitive with other refractometric sensors. For example, from a 20 μm diameter device we measure a sensitivity of 200 $\pm$ 30 nm/RIU with a loaded quality factor of 1.5 $\times$ 10$^4$, and a limit of detection down to (1.3 $\pm$ 0.1) $\times$ 10$^{-6}$ RIU.
△ Less
Submitted 22 December, 2014;
originally announced December 2014.
-
Optical microscope and tapered fiber coupling apparatus for a dilution refrigerator
Authors:
A. J. R. MacDonald,
G. G. Popowich,
B. D. Hauer,
P. H. Kim,
A. Fredrick,
X. Rojas,
P. Doolin,
J. P. Davis
Abstract:
We have developed a system for tapered fiber measurements of optomechanical resonators inside a dilution refrigerator, which is compatible with both on- and off-chip devices. Our apparatus features full three-dimensional control of the taper-resonator coupling conditions enabling critical coupling, with an overall fiber transmission efficiency of up to 70%. Notably, our design incorporates an opti…
▽ More
We have developed a system for tapered fiber measurements of optomechanical resonators inside a dilution refrigerator, which is compatible with both on- and off-chip devices. Our apparatus features full three-dimensional control of the taper-resonator coupling conditions enabling critical coupling, with an overall fiber transmission efficiency of up to 70%. Notably, our design incorporates an optical microscope system consisting of a coherent bundle of 37,000 optical fibers for real-time imaging of the experiment at a resolution of $\sim$1 $μ$m. We present cryogenic optical and optomechanical measurements of resonators coupled to tapered fibers at temperatures as low as 9 mK.
△ Less
Submitted 27 November, 2014;
originally announced November 2014.