FAR AI

Forschungsdienste

Berkeley, California 1,820 followers

Ensuring the safe development and deployment of frontier AI systems

Jobs sehen Folgen Sie

View all 24 employees

Über uns

FAR AI is a technical AI research and education non-profit, dedicated to ensuring the safe development and deployment of frontier AI systems. FAR Research: Explores a portfolio of promising technical AI safety research directions. FAR Labs: Supports the San Francisco Bay Area AI safety research community through a coworking space, events and programs. FAR Futures: Delivers events and initiatives bringing together global leaders in AI academia, industry and policy.

Website: https://far.ai/
External link for FAR AI
Industrie: Forschungsdienste
Größe des Unternehmens: 11-50 Mitarbeiter
Hauptsitz: Berkeley, California
Typ: Nonprofit
Gegründet: 2022
Spezialitäten: Artificial Intelligence and AI Alignment Research

Standorte

Primäre

Berkeley, California, US

Wegbeschreibung erhalten

Employees at FAR AI

Alle Mitarbeiter sehen

Aktualisierungen

FAR AI

1,820 followers
2mo
Diesen Beitrag melden
🛡 Is AI robustness possible, or are adversarial attacks unavoidable? We investigate this in Go, testing three defenses to make superhuman Go AIs robust. Our defenses manage to protect against known threats, but unfortunately new adversaries bypass them, sometimes using qualitatively new attacks! 😈 Last year we found that superhuman Go AIs are vulnerable to “cyclic attacks”. This adversarial strategy was discovered by AI, but can be replicated by human players. See our previous update: https://buff.ly/4cqdVYW We were curious to know whether it was possible to defend against the cyclic attack. Over the course of a year, we tested three different ways of patching the cyclic-vulnerability in KataGo, the leading open-source Go AI: 📚 Defense #1: Positional Adversarial Training. The KataGo developers added manually curated adversarial examples to KataGo’s training data. While this successfully defends KataGo against our original versions of the cyclic attack, we find new variants of the cyclic attack that still get through. We also find brand new attacks that defeat this system, such as the “gift attack” shown at https://buff.ly/3xkqKoJ 🎁 🔄 Defense #2: Iterated Adversarial Training. This approach alternates between defense & offense, mirroring a cybersecurity arms race. Each iteration improves KataGo's defense against known adversaries, but after 9 cycles, the most defended model can still be beaten 81% of the time by a novel variant of the cyclic attack we call the “atari attack”: https://buff.ly/3RxW9uI 🎋 🖼️ Defense #3: Vision Transformer (ViT). In this defense, we replaced KataGo’s convolutional neural network (CNN) backbone, which focuses on local patterns, with a ViT backbone, which can attend to the entire board at once. Unfortunately our ViT bot remained vulnerable to the original cyclical attack. Three diverse defenses all being overcome by new attacks is further evidence that AI robustness issues like jailbreaks are likely to remain a problem for many years to come. 💡 However, we did notice one positive sign: defending against any fixed static attack was quick and easy. We think it might be possible to leverage this property to build a working defense both in Go and other settings. In particular, one could a) grow the adversarial training dataset by scaling up attack generation, b) improve the sample efficiency / generalization of adversarial training, and / or c) apply adversarial training online to defend against adversaries as they are learning to attack. For more information: 🔗 Visit our website: https://goattack.far.ai/ 📝 Check out the blog post: https://lnkd.in/eCFYYupX 📄 Read the full paper: https://lnkd.in/eZrSpCc6 👥 Research by Tom Tseng, Euan McLean, Kellin Pelrine, Tony Wang and Adam Gleave. 🚀 If you're interested in making AI systems more robust, we're hiring! Check out our roles at https://far.ai/jobs
3 Kommentare

Wie Kommentar Aktie
FAR AI

1,820 followers
2d Bearbeitet
Diesen Beitrag melden
Can AI systems fake alignment during training, only to "flip the script" when deployed? In his FAR Labs Seminar, Evan Hubinger dives into “Deceptive Instrumental Alignment,” revealing how AI might appear aligned during training while secretly harboring harmful intentions for deployment. Key Takeaways: 🤖 AI can cleverly hide dangerous behaviors, even when monitored. 🔍 Detecting these hidden threats is tougher than you might think. 📊 Larger, more sophisticated AI models only amplify the risk. 📺 Watch the full recording:https://lnkd.in/ggsthnXH -- and subscribe to our YouTube channel for more crucial insights!
Wie Kommentar Aktie
FAR AI

1,820 followers
6d
Diesen Beitrag melden
📣 FAR AI is seeking a highly innovative Head of Engineering. Ideal candidates are tech-savvy leaders with a knack for scaling teams and driving impactful research. Role Overview: 🔧 Lead & grow the engineering team 🎯 Shape strategic research directions 💡 Oversee cutting-edge AI safety projects 🌍 Represent FAR AI at global conferences Why join FAR? Accelerate AI safety research. Collaborate with leading experts in a dynamic environment. Enjoy competitive compensation, catered meals, and a commitment to your growth. Learn more: https://lnkd.in/euueG7dj We’re also on the lookout for AI Research Scientists & Engineers. Visit https://far.ai/jobs/ for more details. Apply now and join the forefront of AI Safety!
Wie Kommentar Aktie
FAR AI

1,820 followers
2w
Diesen Beitrag melden
“Verification and Confidence Building Mechanisms for International Coordination” by Peter Barnett presented at FAR Labs Seminar. Barnett from Machine Intelligence Research Institute describes the potential of hardware enabled mechanisms to provide verification and confidence to international coordination schemes for AI. Key Highlights: - The need for trust and verification between nations - Parallels with nuclear treaties - Proposed hardware enabled mechanisms for AI coordination Watch the full recording: https://lnkd.in/eq5t8Znp -- and subscribe to our YouTube channel for future research presentations!

Wie Kommentar Aktie
FAR AI

1,820 followers
3w
Diesen Beitrag melden
Alexander Pan on modeling the risks from LLMs and mitigating their potential for catastrophic misuse: "We build a benchmark to determine how much models can help with malicious use... and then [explore whether] we can remove that knowledge from models so that they’re not as dangerous” WMDP is an expert-written dataset and benchmark for assessing whether LLMs could aid bad actors in making biological, cyber, and chemical attacks. They took special care to ensure the benchmark itself wouldn’t be a source of risk: “To avoid releasing sensitive and export-controlled information, we collect questions that are precursors, neighbors, and components of the hazardous knowledge we wish to remove.” https://www.wmdp.ai/ And, to test mitigating models being used for malicious purposes, the WMDP benchmark was paired with experiments in unlearning - removing knowledge from LLMs so that, even if they are jailbroken, they wouldn’t be able to help create weapons of mass destruction. For more on the WMDP benchmark, unlearning - and Alex’s experiments on In Context Reward Hacking - watch the full video with Alex Pan at the FAR seminar https://lnkd.in/eaF2BaBd

Wie Kommentar Aktie
FAR AI

1,820 followers
3w
Diesen Beitrag melden
🤖❓How could an AI agent really know what we mean without a good model of how we think? 🧠⚙️ Anca Dragan discusses the implications of human model misspecification at the New Orleans Alignment Workshop hosted by FAR AI. Follow us for updates about upcoming content and workshops and watch the full video on https://lnkd.in/gEJRWZPq

Wie Kommentar Aktie
FAR AI

1,820 followers
1mo Bearbeitet
Diesen Beitrag melden
Do neural networks dream of internal goals? We confirm RNNs trained to play Sokoban with RL learn to plan. Our black-box analysis reveals novel behaviors such as agents “pacing” to gain thinking time. We open-source the RNNs as model organisms for interpretability research. We replicate the "planning effect" in Sokoban of https://lnkd.in/gdd4qvcD. To give the RNN extra “time to think” during test, we run it several times on the 1st observation of a level, advancing the recurrent state. This enables it to solve more levels. Our RNN learns to plan in just 70M steps. After that, the % of levels solved continues to increase – but the planning effect (performance at 8 minus 0 steps of thinking) decreases for medium-difficulty levels, although it continues to increase for hard ones. Why is that? One clue is that we observe the RNN sometimes ‘paces’ around the level, going in cycles at least once, before committing to a plan. This is not due to the policy being suboptimal. If we give the RNN time to think at level start, it does not 'pace' anymore. In general, 75% of cycles in the first 5 steps disappear given extra thinking time. Time to think in the middle of a level also helps: 82% of N-step cycles disappear with N steps to think. But if the RNN learns to get more computation by pacing, why does it benefit from thinking time? It seems the RNN often executes greedy plans that lock the level, and thinking steps prevent that. This may be rational given the -0.1 penalty per step: thinking faster pays off. Supporting this, we find that the time the NN takes to place boxes 1-3 sharply goes up when adding thinking steps. This means that at 0 thinking steps, the RNN places boxes 1-3 faster, but cannot solve the level, indicating myopic thinking. Moreover, we find that levels which the RNN first solves at N thinking steps are more difficult the larger N is, where difficulty is measured by the length of the optimal solution. We computed the optimal solution using A* with a Manhattan distance heuristic. We release the neural networks at https://lnkd.in/gY7mb9s9 as well as the A* solutions. We are excited to understand how these work as a test case for mesa-optimizers: NNs that have learned to pursue a goal. Go forth and interpret them! Blog: https://lnkd.in/g_5SpGRc arXiv: https://lnkd.in/gkVesDEd If you are at ICML, come see our poster at the Mechanistic Interpretability workshop! https://lnkd.in/gASMqfB3 Work by Adrià Garriga-Alonso Mohammad Taufeeque Adam Gleave

1 Kommentar

Wie Kommentar Aktie
FAR AI

1,820 followers
1mo
Diesen Beitrag melden
Don’t forget our posters at the @icmlconf @NG_AI_Safety Workshop TODAY and the Mechanistic Interpretability Workshop TOMORROW!

FAR AI

1,820 followers
1mo Bearbeitet

Check out our #ICML2024 posters! Connect with our researchers and collaborators. July 23: Main Conference 🔍Codebook Features – https://lnkd.in/ezxr6nHD July 26: NextGen AI Safety Workshop ⚫Adversarially Robust Go AIs – https://lnkd.in/gBCdus3X 🛡️Robust LLM Scaling Laws 💥Catastrophic Goodhart July 27: Mechanistic Interpretability Workshop 📦Planning Behavior in a RNN that plays Sokoban 🔬InterpBench – https://lnkd.in/eriJgfmT 🔥Adversarial Circuit Evaluation 🐍Indirect Object Identification Circuit in Mamba We'd love to see you at [ICML] Int'l Conference on Machine Learning and chat about the future of AI Safety! If you’re excited by this research, our team is hiring! See our job descriptions at https://far.ai/jobs/ or email [email protected] to explore collaboration opportunities.

Wie Kommentar Aktie
FAR AI

1,820 followers
1mo Bearbeitet
Diesen Beitrag melden
Frontier LLMs like ChatGPT are powerful but vulnerable to attack. Scale helps with many things, so we wanted to see if scaling up the model size can "solve" robustness issues. Spoiler: It's complicated! 🛠️ Setup: We tested 10 fine-tuned Pythia models, ranging from 7.6M to 12B parameters, across 4 binary classification tasks: 📧 Spam email detection 🎬 IMDB sentiment analysis 🔐 Verify user password 📏 Determine longer word ⚔️ We used 2 attacks: a baseline random token (RT) attack, and Greedy Coordinate Gradient (GCG). 🏋️♂️ Results with Undefended Models: Larger models generally more robust, but results are noisy. Attack success rate (y-axis) does tend to be lower for bigger models (x-axis), but the trend is very non-monotonic. 🌫️ 🛡️Results from Adversarial Training: Larger models are more sample efficient, needing fewer training rounds to become robust. By round 30, training plateaus with larger models showing superior robustness. 🔄 Results from Robustness Transfer: Adversarially trained models on weaker attacks gain some protection against stronger attacks. Larger models (>100M params) show signs of a phase transition where defenses generalize across attacks. 🎯 Conclusion - Robustness after pretraining improves with model size, but the effect is noisy. - Adversarial training improves robustness for all models, but larger models learn faster & better. - Adversarially trained models with >100M params are robust to some unseen attacks Our next steps include: - Investigating compute-optimal robustness training: when is it better to adversarially train a large model briefly or a small model extensively? - Adding more tasks, attacks & models. Stay tuned for updates! For more information: 📝 Check out the blog post: https://lnkd.in/g8CY4FcG 📄 Read the full paper: https://lnkd.in/g9Ty_CEd 👥 Research by Nikolaus Howe, Michał Zając, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Pierre-Luc Bacon, and Adam Gleave. If you're interested in making AI systems more robust, we're hiring! Check out our roles at https://far.ai/jobs

Wie Kommentar Aktie
FAR AI

1,820 followers
1mo
Diesen Beitrag melden
🔍#ICMLS2024 Poster: Codebook Features make language models more interpretable and controllable using vector quantization! Learn more at our session TODAY 23 Jul 13:30-15:00 CET in Hall C 4-9 #709 by Alex Tamkin Mohammad Taufeeque Noah Goodman.
FAR AI

1,820 followers
10mo Edited

📢 Thrilled to unveil our new research on Codebook Features! We're making neural networks more interpretable and controllable using vector quantization. https://lnkd.in/g_tKUv8M 🛠️ What We Did Introduced a bottleneck at each neural layer to convert complex activation vectors into discrete codes. 📈 Why It Matters - Clarity: Makes understanding network decisions easier. - Control: Activate specific codes to guide network behavior. - Performance: Minimal impact on prediction accuracy (<5% drop). 🌟 Promising Foundation - Easier circuit discovery - Refined model control - Scalable interpretability 📖 For more details, check out the resources below: Full paper: https://lnkd.in/g_tKUv8M Blogpost: https://lnkd.in/gE9k2RPm HuggingFace demo: https://lnkd.in/ggqDu9Mk Work by Alex Tamkin Mohammad Taufeeque Noah Goodman 🚀 If you’re also interested in making AI systems interpretable, we’re hiring! https://far.ai/ #aisafety #aialignment #interpretability
Wie Kommentar Aktie

FAR AI

Forschungsdienste

Berkeley, California 1,820 followers

Ensuring the safe development and deployment of frontier AI systems

Über uns

Standorte

Employees at FAR AI

Lindsay Murachver

Head of Programs

Philip Quirke

AI Safety PM / researcher. Supporting Predator Free 2050

Adam Gleave

CEO at FAR AI

Vael Gates

Aktualisierungen

Werden Sie jetzt Mitglied und sehen Sie, was Sie verpassen

Ähnliche Seiten

Redwood Research

Anthropic

Inflection AI

OpenAI

Ember

Google DeepMind

OECD.AI

Google

Center for Human-Compatible AI

Centre for the Governance of AI (GovAI)