Vision-Language Models as a Source of Rewards

Baumli, Kate; Baveja, Satinder; Behbahani, Feryal; Chan, Harris; Comanici, Gheorghe; Flennerhag, Sebastian; Gazeau, Maxime; Holsheimer, Kristian; Horgan, Dan; Laskin, Michael; Lyle, Clare; Masoom, Hussain; McKinney, Kay; Mnih, Volodymyr; Neitz, Alexander; Nikulin, Dmitry; Pardo, Fabio; Parker-Holder, Jack; Quan, John; Rocktäschel, Tim; Sahni, Himanshu; Schaul, Tom; Schroecker, Yannick; Spencer, Stephen; Steigerwald, Richie; Wang, Luyu; Zhang, Lei

Computer Science > Machine Learning

arXiv:2312.09187 (cs)

[Submitted on 14 Dec 2023 (v1), last revised 12 Jul 2024 (this version, v3)]

Title:Vision-Language Models as a Source of Rewards

Abstract:Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

Comments:	10 pages, 5 figures
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2312.09187 [cs.LG]
	(or arXiv:2312.09187v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2312.09187

Submission history

From: Harris Chan [view email]
[v1] Thu, 14 Dec 2023 18:06:17 UTC (938 KB)
[v2] Thu, 22 Feb 2024 01:44:52 UTC (940 KB)
[v3] Fri, 12 Jul 2024 21:14:32 UTC (940 KB)

Computer Science > Machine Learning

Title:Vision-Language Models as a Source of Rewards

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Vision-Language Models as a Source of Rewards

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators