OpenAI Codex: Difference between revisions

Content deleted Content added
m →‎Issues: clean up
 
(27 intermediate revisions by 26 users not shown)
Line 1:
{{short description|Artificial intelligence model geared towards programming}}
 
'''OpenAI Codex''' is an [[artificial intelligence]] model developed by [[OpenAI]]. It parses natural language and generates [[computer program|code]] in response. It is used to powerpowers [[GitHub Copilot]], a programming [[autocompletion]] tool developed for select [[Integrated development environment|IDEs]], like [[Visual Studio Code]] and [[Vim (text editor)|Neovim]].<ref name="OAI">{{cite web|last=Zaremba|first=Wojciech|author-link=Wojciech Zaremba|date=August 10, 2021|title=OpenAI Codex|url=https://openai.com/blog/openai-codex/|url-status=live|access-date=2021-09-03|website=[[OpenAI]]|archive-date=2023-02-03|archive-url=https://web.archive.org/web/20230203201912/https://openai.com/blog/openai-codex/|url-status=live}}</ref> Codex is a descendant of OpenAI's [[GPT-3]] model, [[fine-tuning (machine learning)|fine-tuned]] for use in programming applications.
 
OpenAI released an [[API]] for Codex in [[closed beta]].<ref name="OAI" /> In March 2023, OpenAI shut down access to Codex.<ref>{{Cite web |last=Kemper |first=Jonathan |date=2023-03-22 |title=OpenAI kills its Codex code model, recommends GPT3.5 instead |url=https://the-decoder.com/openai-kills-code-model-codex/ |access-date=2023-03-29 |website=THE DECODER |language=en-US |archive-date=2023-06-01 |archive-url=https://web.archive.org/web/20230601195835/https://the-decoder.com/openai-kills-code-model-codex/ |url-status=live }}</ref> Due to public appeals from researchers, OpenAI reversed course.<ref>{{Cite tweet |user=OfficialLoganK |author=Logan Kilpatrick |number=1638336152800206858 |title=Hey Carolyn, we will continue to support Codex access via our Researcher Access Program. Sorry for any confusion and hopefully the research is going well! |access-date=2023-04-08}}</ref> The Codex model can still be used by researchers of the OpenAI Research Access Program.<ref>{{Cite web |title=Researcher Access Program application |url=https://openai.com/form/researcher-access-program |access-date=2023-04-08 |website=openai.com |language=en-US |archive-date=2023-10-10 |archive-url=https://web.archive.org/web/20231010073704/https://openai.com/form/researcher-access-program |url-status=live }}</ref>
OpenAI has released an [[API]] for Codex in [[closed beta]].<ref name="OAI" />
 
== Capabilities ==
Based on GPT-3, a [[neural network]] trained on text, Codex haswas additionally been trained on 159 gigabytes of [[Python (programming language)|Python]] code from 54 million [[GitHub]] repositories.<ref name="VB-bias">{{Cite news|last=Wiggers|first=Kyle|date=July 8, 2021|title=OpenAI warns AI behind GitHub's Copilot may be susceptible to bias|work=[[VentureBeat]]|url=https://venturebeat.com/2021/07/08/openai-warns-ai-behind-githubs-copilot-may-be-susceptible-to-bias/|access-date=2021-09-03|archive-date=2023-02-03|archive-url=https://web.archive.org/web/20230203201912/https://venturebeat.com/business/openai-warns-ai-behind-githubs-copilot-may-be-susceptible-to-bias/|url-status=live}}</ref><ref name="IQ">{{Cite news|last=Alford|first=Anthony|date=August 31, 2021|title=OpenAI Announces 12 Billion Parameter Code-Generation AI Codex|work=InfoQ|url=https://www.infoq.com/news/2021/08/openai-codex/|access-date=2021-09-03|archive-date=2022-07-09|archive-url=https://web.archive.org/web/20220709221205/https://www.infoq.com/news/2021/08/openai-codex/|url-status=live}}</ref> A typical use case of Codex is typingfor a user to type a comment, such as "<code>//compute the moving average of an array for a given window size</code>", then usinguse the AI to suggest a block of code satisfyingthat satisfies that comment prompt.<ref name="RegTA">{{Cite news|last1=Anderson|first1=Tim|last2=Quach|first2=Katyanna|date=July 6, 2021|title=GitHub Copilot auto-coder snags emerge, from seemingly spilled secrets to bad code, but some love it|work=[[The Register]]|url=https://www.theregister.com/2021/07/06/github_copilot_autocoder_caught_spilling/|access-date=2021-09-04|archive-date=2023-06-02|archive-url=https://web.archive.org/web/20230602214528/https://www.theregister.com/2021/07/06/github_copilot_autocoder_caught_spilling/|url-status=live}}</ref> OpenAI has stated that Codex can complete approximately 37% of requests and is meant to make human programming faster rather than to replace it;. accordingAccording to OpenAI's blog, Codex excels most at "mapping [...] simple problems to existing code", which they describe as "probably the least fun part of programming".<ref name="SH">{{Cite news|last=Dorrier|first=Jason|date=August 15, 2021|title=OpenAI's Codex Translates Everyday Language Into Computer Code|work=[[SingularityHub]]|url=https://singularityhub.com/2021/08/15/openais-codex-translates-everyday-language-into-computer-code/|access-date=2021-09-03|archive-date=2023-05-26|archive-url=https://web.archive.org/web/20230526045651/https://singularityhub.com/2021/08/15/openais-codex-translates-everyday-language-into-computer-code/|url-status=live}}</ref><ref name="VB">{{Cite news|last=Dickson|first=Ben|date=August 16, 2021|title=What to expect from OpenAI's Codex API|work=[[VentureBeat]]|url=https://venturebeat.com/2021/08/16/what-to-expect-from-openais-codex-api/|access-date=2021-09-03|archive-date=2023-02-03|archive-url=https://web.archive.org/web/20230203201913/https://venturebeat.com/ai/what-to-expect-from-openais-codex-api/|url-status=live}}</ref> [[Jeremy Howard (entrepreneur)|Jeremy Howard]], co-founder of [[Fast.ai]], stated that "[[Codex]] is a way of getting code written without having to write as much code", and that "it is not always correct, but it is just close enough".<ref name="NYT">{{Cite news|last=Metz|first=Cade|date=September 9, 2021|title=A.I. Can Now Write Its Own Computer Code. That's Good News for Humans.|work=[[The New York Times]]|url=https://www.nytimes.com/2021/09/09/technology/codex-artificial-intelligence-coding.html|access-date=2021-09-16|archive-date=2022-03-30|archive-url=https://web.archive.org/web/20220330010719/https://www.nytimes.com/2021/09/09/technology/codex-artificial-intelligence-coding.html|url-status=live}}</ref> According to a paper written by OpenAI researchers, when attemptingCodex attempted each test case 100 times, it generated working solutions for 70.2% of prompts had working solutions.<ref name="arXiv">{{Cite arXiv|last1=Chen|first1=Mark|last2=Tworek|first2=Jerry|last3=Jun|first3=Heewoo|last4=Yuan|first4=Qiming|last5=Pinto|first5=Henrique Ponde de Oliveira|last6=Kaplan|first6=Jared|last7=Edwards|first7=Harri|last8=Burda|first8=Yuri|last9=Joseph|first9=Nicholas|last10=Brockman|first10=Greg|last11=Ray|first11=Alex|date=2021-07-14|title=Evaluating Large Language Models Trained on Code |eprint=2107.03374 |class=cs}}</ref>
 
OpenAI claims that Codex iscan ablecreate to functioncode in over a dozen programming languages, including [[Go (programming language)|Go]], [[JavaScript]], [[Perl]], [[PHP]], [[Ruby (programming language)|Ruby]], [[Shell (programming language)|Shell]], [[Swift (programming language)|Swift]], and [[TypeScript]], though it is most effective in Python.<ref name="OAI" /> According to ''[[VentureBeat]]'', demonstrations uploaded by OpenAI showed impressive [[coreference resolution]] capabilities. The demonstrators were able to create a [[browser game]] in JavaScript and generate data science charts using [[matplotlib]].<ref name="VB" />
 
A very powerful language model called OpenAI Codex was created expressly to generate code in response to natural language commands. It is capable of understanding and producing code in a multitude of areas because it is compatible with a large number of programming languages and libraries. Codex is a useful tool for developers who want to optimize their coding processes because it can debug, parse natural language inquiries, and provide code completions.<ref>{{Cite web |title= Best AI Headshot Generators |url=https://supermachines.io/best-ai-headshot-generators |access-date=2024-03-12 |language=en-US}}</ref>
OpenAI has shown that Codex is able to interface with services and apps such as [[Mailchimp]], [[Microsoft Word]], [[Spotify]], and [[Google Calendar]].<ref name="VB" /><ref name="Verge">{{Cite news|last=Vincent|first=James|date=August 10, 2021|title=OpenAI can translate English into code with its new machine learning software Codex|work=[[The Verge]]|url=https://www.theverge.com/2021/8/10/22618128/openai-codex-natural-language-into-code-api-beta-access|access-date=2021-09-03}}</ref> [[Microsoft]] is reportedly interested in exploring Codex's capabilities.<ref name="Verge" />
 
OpenAI has shownshowed that Codex is able tocan interface with services and apps such as [[Mailchimp]], [[Microsoft Word]], [[Spotify]], and [[Google Calendar]].<ref name="VB" /><ref name="Verge">{{Cite news|last=Vincent|first=James|date=August 10, 2021|title=OpenAI can translate English into code with its new machine learning software Codex|work=[[The Verge]]|url=https://www.theverge.com/2021/8/10/22618128/openai-codex-natural-language-into-code-api-beta-access|access-date=2021-09-03|archive-date=2021-09-02|archive-url=https://web.archive.org/web/20210902142401/https://www.theverge.com/2021/8/10/22618128/openai-codex-natural-language-into-code-api-beta-access|url-status=live}}</ref> [[Microsoft]] is {{vague|text=reportedly interested in exploring|reason=this sentence says only marginally more than nothing|date=March 2023}} Codex's capabilities.<ref name="Verge" />
 
== Issues ==
OpenAI demonstrations showcased flaws such as inefficient code and one-off quirks in code samples.<ref name="VB" /> In an interview with ''[[The Verge]]'', OpenAI [[chief technology officer]] [[Greg Brockman]] said that "sometimes [Codex] doesn't quite know exactly what you're asking" and that it can require some trial and error.<ref name="Verge" /> OpenAI researchers found that Codex struggles with multi-step and {{clarify|text=higher-level|date=March 2023}} prompts, often failing or yielding counter-intuitive behavior. Additionally, they brought up several safety issues, such as over-reliance by novice programmers, biases based on the training data, and security impacts due to vulnerable code.<ref name="arXiv" />
 
''VentureBeat'' has stated that because Codex is trained on public data, it could be vulnerable to "data poisoning" via intentional uploads of malicious code.<ref name="VB" /> According to a study by researchers from [[New York University]], approximately 40% of code generated by [[GitHub Copilot]] (which uses Codex) in scenarios relevant to high-risk [[Common Weakness Enumeration|CWEs]] included glitches or other exploitable design flaws.<ref name="RegTC">{{cite arxivarXiv |lastlast1=Pearce |firstfirst1=Hammond |last2=Ahmad |first2=Baleegh |last3=Tan |first3=Benjamin |last4=Dolan-Gavitt |first4=Brendan |last5=Karri |first5=Ramesh |date=2021-12-16 |title=Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions |arxivclass=cs.CR |eprint=2108.09293 }}</ref>
 
===Copyright===
The [[Free Software Foundation]] has expressed concerns that code snippets generated by Copilot and Codex could unknowingly [[copyright infringement|violate copyright]], and in particular the condition of the [[GPL]] that requires [[derivative work]]s to be licensed under equivalent terms.<ref name="IW-FSF">{{Cite news|last=Krill|first=Paul|date=August 2, 2021|title=GitHub Copilot is 'unacceptable and unjust,' says Free Software Foundation|work=[[InfoWorld]]|url=https://www.infoworld.com/article/3627319/github-copilot-is-unacceptable-and-unjust-says-free-software-foundation.html|access-date=2021-09-03|archive-date=2021-09-03|archive-url=https://web.archive.org/web/20210903201419/https://www.infoworld.com/article/3627319/github-copilot-is-unacceptable-and-unjust-says-free-software-foundation.html|url-status=live}}</ref> Issues they raised include whether training on public repositories falls into [[fair use]] or not, how developers could discover infringing generated code, whether trained [[machine learning]] models could be considered modifiable source code or a compilation of the training data, and if machine learning models could themselves be copyrighted and by whom.<ref name="IW-FSF" /><ref name="FSF">{{Cite news|last=Robertson|first=Donald|date=2021-07-28|title=FSF-funded call for white papers on philosophical and legal questions around Copilot: Submit before Monday, August 23, 2021|work=[[Free Software Foundation]]|url=https://www.fsf.org/blogs/licensing/fsf-funded-call-for-white-papers-on-philosophical-and-legal-questions-around-copilot|access-date=2021-09-04|archive-date=2021-08-11|archive-url=https://web.archive.org/web/20210811003717/https://www.fsf.org/blogs/licensing/fsf-funded-call-for-white-papers-on-philosophical-and-legal-questions-around-copilot|url-status=live}}</ref> An internal GitHub study found that approximately 0.1% of generated code contained direct copies from the training data. OneIn specificone example has been raised, in which the model outputted the originaltraining data code ofimplementing the [[fast inverse square root]] algorithm, including comments and an incorrect [[copyright notice]].<ref name="RegTA"/>
 
In response, OpenAI has stated that "legal uncertainty on the copyright implications of training AI systems imposes substantial costs on AI developers and so should be authoritatively resolved."<ref name="RegTA" />

The copyright issues with Codex have been compared to the ''[[Authors Guild, Inc. v. Google, Inc.]]'' court case, in which judges ruled that [[Google Books]]'s use of text snippets from millions of [[Book scanning|scanned books]] constituted fair use.<ref name="RegTA" /><ref name="WIRED">{{Cite magazine|last=Barber|first=Gregory|date=July 12, 2021|title=GitHub's Commercial AI Tool Was Built From Open Source Code|magazine=[[WIRED]]|url=https://www.wired.com/story/github-commercial-ai-tool-built-open-source-code/|access-date=2021-09-04|archive-date=2021-07-25|archive-url=https://web.archive.org/web/20210725233825/https://www.wired.com/story/github-commercial-ai-tool-built-open-source-code/|url-status=live}}</ref> However, use of text snippets from books provides for a reliable reference of the copyright owner, as opposed to compiled works used for the training algorithm data where the final output is made without any such reference.
 
== References ==
{{reflist}}
 
{{OpenAI navbox}}
[[Category:Artificial intelligence]]
 
[[Category:Deep learning software applications]]
[[Category:Copyright infringement of software]]
[[Category:Generative pre-trained transformers]]
[[Category:OpenAI|Codex]]