FAR AI

FAR AI

Forschungsdienste

Berkeley, California 1,820 followers

Ensuring the safe development and deployment of frontier AI systems

Über uns

FAR AI is a technical AI research and education non-profit, dedicated to ensuring the safe development and deployment of frontier AI systems. FAR Research: Explores a portfolio of promising technical AI safety research directions. FAR Labs: Supports the San Francisco Bay Area AI safety research community through a coworking space, events and programs. FAR Futures: Delivers events and initiatives bringing together global leaders in AI academia, industry and policy.

Website
https://far.ai/
Industrie
Forschungsdienste
Größe des Unternehmens
11-50 Mitarbeiter
Hauptsitz
Berkeley, California
Typ
Nonprofit
Gegründet
2022
Spezialitäten
Artificial Intelligence and AI Alignment Research

Standorte

Employees at FAR AI

Aktualisierungen

  • View organization page for FAR AI, graphic

    1,820 followers

    🛡 Is AI robustness possible, or are adversarial attacks unavoidable? We investigate this in Go, testing three defenses to make superhuman Go AIs robust. Our defenses manage to protect against known threats, but unfortunately new adversaries bypass them, sometimes using qualitatively new attacks! 😈 Last year we found that superhuman Go AIs are vulnerable to “cyclic attacks”. This adversarial strategy was discovered by AI, but can be replicated by human players. See our previous update: https://buff.ly/4cqdVYW We were curious to know whether it was possible to defend against the cyclic attack. Over the course of a year, we tested three different ways of patching the cyclic-vulnerability in KataGo, the leading open-source Go AI: 📚 Defense #1: Positional Adversarial Training. The KataGo developers added manually curated adversarial examples to KataGo’s training data. While this successfully defends KataGo against our original versions of the cyclic attack, we find new variants of the cyclic attack that still get through. We also find brand new attacks that defeat this system, such as the “gift attack” shown at https://buff.ly/3xkqKoJ 🎁 🔄 Defense #2: Iterated Adversarial Training. This approach alternates between defense & offense, mirroring a cybersecurity arms race. Each iteration improves KataGo's defense against known adversaries, but after 9 cycles, the most defended model can still be beaten 81% of the time by a novel variant of the cyclic attack we call the “atari attack”: https://buff.ly/3RxW9uI 🎋 🖼️ Defense #3: Vision Transformer (ViT). In this defense, we replaced KataGo’s convolutional neural network (CNN) backbone, which focuses on local patterns, with a ViT backbone, which can attend to the entire board at once. Unfortunately our ViT bot remained vulnerable to the original cyclical attack. Three diverse defenses all being overcome by new attacks is further evidence that AI robustness issues like jailbreaks are likely to remain a problem for many years to come. 💡 However, we did notice one positive sign: defending against any fixed static attack was quick and easy. We think it might be possible to leverage this property to build a working defense both in Go and other settings. In particular, one could a) grow the adversarial training dataset by scaling up attack generation, b) improve the sample efficiency / generalization of adversarial training, and / or c) apply adversarial training online to defend against adversaries as they are learning to attack. For more information: 🔗 Visit our website: https://goattack.far.ai/ 📝 Check out the blog post: https://lnkd.in/eCFYYupX 📄 Read the full paper: https://lnkd.in/eZrSpCc6 👥 Research by Tom Tseng, Euan McLean, Kellin Pelrine, Tony Wang and Adam Gleave. 🚀 If you're interested in making AI systems more robust, we're hiring! Check out our roles at https://far.ai/jobs

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for FAR AI, graphic

    1,820 followers

    Can AI systems fake alignment during training, only to "flip the script" when deployed? In his FAR Labs Seminar, Evan Hubinger dives into “Deceptive Instrumental Alignment,” revealing how AI might appear aligned during training while secretly harboring harmful intentions for deployment. Key Takeaways: 🤖 AI can cleverly hide dangerous behaviors, even when monitored. 🔍 Detecting these hidden threats is tougher than you might think. 📊 Larger, more sophisticated AI models only amplify the risk. 📺 Watch the full recording:https://lnkd.in/ggsthnXH -- and subscribe to our YouTube channel for more crucial insights!

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for FAR AI, graphic

    1,820 followers

    📣 FAR AI is seeking a highly innovative Head of Engineering. Ideal candidates are tech-savvy leaders with a knack for scaling teams and driving impactful research. Role Overview: 🔧 Lead & grow the engineering team 🎯 Shape strategic research directions 💡 Oversee cutting-edge AI safety projects 🌍 Represent FAR AI at global conferences Why join FAR? Accelerate AI safety research. Collaborate with leading experts in a dynamic environment. Enjoy competitive compensation, catered meals, and a commitment to your growth. Learn more: https://lnkd.in/euueG7dj We’re also on the lookout for AI Research Scientists & Engineers. Visit https://far.ai/jobs/ for more details. Apply now and join the forefront of AI Safety!

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for FAR AI, graphic

    1,820 followers

    “Verification and Confidence Building Mechanisms for International Coordination” by Peter Barnett presented at FAR Labs Seminar. Barnett from Machine Intelligence Research Institute describes the potential of hardware enabled mechanisms to provide verification and confidence to international coordination schemes for AI. Key Highlights: - The need for trust and verification between nations - Parallels with nuclear treaties - Proposed hardware enabled mechanisms for AI coordination Watch the full recording: https://lnkd.in/eq5t8Znp -- and subscribe to our YouTube channel for future research presentations! 

  • View organization page for FAR AI, graphic

    1,820 followers

    Alexander Pan on modeling the risks from LLMs and mitigating their potential for catastrophic misuse: "We build a benchmark to determine how much models can help with malicious use... and then [explore whether] we can remove that knowledge from models so that they’re not as dangerous” WMDP is an expert-written dataset and benchmark for assessing whether LLMs could aid bad actors in making biological, cyber, and chemical attacks. They took special care to ensure the benchmark itself wouldn’t be a source of risk: “To avoid releasing sensitive and export-controlled information, we collect questions that are precursors, neighbors, and components of the hazardous knowledge we wish to remove.” https://www.wmdp.ai/ And, to test mitigating models being used for malicious purposes, the WMDP benchmark was paired with experiments in unlearning - removing knowledge from LLMs so that, even if they are jailbroken, they wouldn’t be able to help create weapons of mass destruction. For more on the WMDP benchmark, unlearning - and Alex’s experiments on In Context Reward Hacking - watch the full video with Alex Pan at the FAR seminar https://lnkd.in/eaF2BaBd

  • View organization page for FAR AI, graphic

    1,820 followers

    Do neural networks dream of internal goals? We confirm RNNs trained to play Sokoban with RL learn to plan. Our black-box analysis reveals novel behaviors such as agents “pacing” to gain thinking time. We open-source the RNNs as model organisms for interpretability research. We replicate the "planning effect" in Sokoban of https://lnkd.in/gdd4qvcD. To give the RNN extra “time to think” during test, we run it several times on the 1st observation of a level, advancing the recurrent state. This enables it to solve more levels. Our RNN learns to plan in just 70M steps. After that, the % of levels solved continues to increase – but the planning effect (performance at 8 minus 0 steps of thinking) decreases for medium-difficulty levels, although it continues to increase for hard ones. Why is that? One clue is that we observe the RNN sometimes ‘paces’ around the level, going in cycles at least once, before committing to a plan. This is not due to the policy being suboptimal. If we give the RNN time to think at level start, it does not 'pace' anymore. In general, 75% of cycles in the first 5 steps disappear given extra thinking time. Time to think in the middle of a level also helps: 82% of N-step cycles disappear with N steps to think. But if the RNN learns to get more computation by pacing, why does it benefit from thinking time? It seems the RNN often executes greedy plans that lock the level, and thinking steps prevent that. This may be rational given the -0.1 penalty per step: thinking faster pays off. Supporting this, we find that the time the NN takes to place boxes 1-3 sharply goes up when adding thinking steps. This means that at 0 thinking steps, the RNN places boxes 1-3 faster, but cannot solve the level, indicating myopic thinking. Moreover, we find that levels which the RNN first solves at N thinking steps are more difficult the larger N is, where difficulty is measured by the length of the optimal solution. We computed the optimal solution using A* with a Manhattan distance heuristic. We release the neural networks at https://lnkd.in/gY7mb9s9 as well as the A* solutions. We are excited to understand how these work as a test case for mesa-optimizers: NNs that have learned to pursue a goal. Go forth and interpret them! Blog: https://lnkd.in/g_5SpGRc arXiv: https://lnkd.in/gkVesDEd If you are at ICML, come see our poster at the Mechanistic Interpretability workshop! https://lnkd.in/gASMqfB3 Work by Adrià Garriga-Alonso Mohammad Taufeeque Adam Gleave

  • View organization page for FAR AI, graphic

    1,820 followers

    Don’t forget our posters at the @icmlconf @NG_AI_Safety Workshop TODAY and the Mechanistic Interpretability Workshop TOMORROW!

    View organization page for FAR AI, graphic

    1,820 followers

    Check out our #ICML2024 posters! Connect with our researchers and collaborators. July 23: Main Conference 🔍Codebook Features – https://lnkd.in/ezxr6nHD July 26: NextGen AI Safety Workshop ⚫Adversarially Robust Go AIs – https://lnkd.in/gBCdus3X  🛡️Robust LLM Scaling Laws 💥Catastrophic Goodhart July 27: Mechanistic Interpretability Workshop 📦Planning Behavior in a RNN that plays Sokoban 🔬InterpBench – https://lnkd.in/eriJgfmT  🔥Adversarial Circuit Evaluation 🐍Indirect Object Identification Circuit in Mamba We'd love to see you at [ICML] Int'l Conference on Machine Learning and chat about the future of AI Safety! If you’re excited by this research, our team is hiring! See our job descriptions at https://far.ai/jobs/ or email [email protected] to explore collaboration opportunities.

  • View organization page for FAR AI, graphic

    1,820 followers

    Frontier LLMs like ChatGPT are powerful but vulnerable to attack. Scale helps with many things, so we wanted to see if scaling up the model size can "solve" robustness issues. Spoiler: It's complicated! 🛠️ Setup: We tested 10 fine-tuned Pythia models, ranging from 7.6M to 12B parameters, across 4 binary classification tasks: 📧 Spam email detection 🎬 IMDB sentiment analysis 🔐 Verify user password 📏 Determine longer word ⚔️ We used 2 attacks: a baseline random token (RT) attack, and Greedy Coordinate Gradient (GCG). 🏋️♂️ Results with Undefended Models: Larger models generally more robust, but results are noisy. Attack success rate (y-axis) does tend to be lower for bigger models (x-axis), but the trend is very non-monotonic. 🌫️ 🛡️Results from Adversarial Training: Larger models are more sample efficient, needing fewer training rounds to become robust. By round 30, training plateaus with larger models showing superior robustness. 🔄 Results from Robustness Transfer: Adversarially trained models on weaker attacks gain some protection against stronger attacks. Larger models (>100M params) show signs of a phase transition where defenses generalize across attacks. 🎯 Conclusion - Robustness after pretraining improves with model size, but the effect is noisy. - Adversarial training improves robustness for all models, but larger models learn faster & better. - Adversarially trained models with >100M params are robust to some unseen attacks Our next steps include: - Investigating compute-optimal robustness training: when is it better to adversarially train a large model briefly or a small model extensively? - Adding more tasks, attacks & models. Stay tuned for updates! For more information: 📝 Check out the blog post: https://lnkd.in/g8CY4FcG 📄 Read the full paper: https://lnkd.in/g9Ty_CEd 👥 Research by Nikolaus Howe, Michał Zając, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Pierre-Luc Bacon, and Adam Gleave. If you're interested in making AI systems more robust, we're hiring! Check out our roles at https://far.ai/jobs

  • View organization page for FAR AI, graphic

    1,820 followers

    🔍#ICMLS2024 Poster: Codebook Features make language models more interpretable and controllable using vector quantization! Learn more at our session TODAY 23 Jul 13:30-15:00 CET in Hall C 4-9 #709 by Alex Tamkin Mohammad Taufeeque Noah Goodman.

    View organization page for FAR AI, graphic

    1,820 followers

    📢 Thrilled to unveil our new research on Codebook Features! We're making neural networks more interpretable and controllable using vector quantization. https://lnkd.in/g_tKUv8M 🛠️ What We Did Introduced a bottleneck at each neural layer to convert complex activation vectors into discrete codes. 📈 Why It Matters - Clarity: Makes understanding network decisions easier. - Control: Activate specific codes to guide network behavior. - Performance: Minimal impact on prediction accuracy (<5% drop). 🌟 Promising Foundation - Easier circuit discovery - Refined model control - Scalable interpretability 📖 For more details, check out the resources below: Full paper: https://lnkd.in/g_tKUv8M  Blogpost: https://lnkd.in/gE9k2RPm  HuggingFace demo: https://lnkd.in/ggqDu9Mk  Work by Alex Tamkin Mohammad Taufeeque Noah Goodman 🚀 If you’re also interested in making AI systems interpretable, we’re hiring! https://far.ai/ #aisafety #aialignment #interpretability

    • Keine alternative Textbeschreibung für dieses Bild

Ähnliche Seiten