Featured Article

News outlets are accusing Perplexity of plagiarism and unethical web scraping

Ambiguity around copyright laws and AI web crawlers complicate matters

Kommentar

Image Credits: Getty Images

In the age of generative AI, when chatbots can provide detailed answers to questions based on content pulled from the internet, the line between fair use and plagiarism, and between routine web scraping and unethical summarization, is a thin one. 

Perplexity AI is a startup that combines a search engine with a large language model that generates answers with detailed responses, rather than just links. Unlike OpenAI’s ChatGPT and Anthropic’s Claude, Perplexity doesn’t train its own foundational AI models, instead using open or commercially available ones to take the information it gathers from the internet and translate that into answers. 

But a series of accusations in June suggests the startup’s approach borders on being unethical. Forbes called out Perplexity for allegedly plagiarizing one of its news articles in the startup’s beta Perplexity Pages feature. And Wired has accused Perplexity of illicitly scraping its website, along with other sites. 

Perplexity, which as of April was working to raise $250 million at a near-$3 billion valuation, maintains that it has done nothing wrong. The Nvidia- and Jeff Bezos-backed company says that it has honored publishers’ requests to not scrape content and that it is operating within the bounds of fair use copyright laws. 

The situation is complicated. At its heart are nuances surrounding two concepts. The first is the Robots Exclusion Protocol, a standard used by websites to indicate that they don’t want their content accessed or used by web crawlers. The second is fair use in copyright law, which sets up the legal framework for allowing the use of copyrighted material without permission or payment in certain circumstances. 

Surreptitiously scraping web content

Image Credits: Getty Images

Wired’s June 19 story claims that Perplexity has ignored the Robots Exclusion Protocol to surreptitiously scrape areas of websites that publishers do not want bots to access. Wired reported that it observed a machine tied to Perplexity doing this on its own news site, as well as across other publications under its parent company, Condé Nast. 

The report noted that developer Robb Knight conducted a similar experiment and came to the same conclusion. 

Both Wired reporters and Knight tested their suspicions by asking Perplexity to summarize a series of URLs and then watching on the server side as an IP address associated with Perplexity visited those sites. Perplexity then “summarized” the text from those URLs — though in the case of one dummy website with limited content that Wired created for this purpose, it returned text from the page verbatim. 

This is where the nuances of the Robots Exclusion Protocol come into play. 

Web scraping is technically when automated pieces of software known as crawlers scour the web to index and collect information from websites. Search engines like Google do this so that web pages can be included in search results. Other companies and researchers use crawlers to gather data from the internet for market analysis, academic research and, as we’ve come to learn, training machine learning models. 

Web scrapers in compliance with this protocol will first look for the “robots.txt” file in a site’s source code to see what is permitted and what is not — today, what is not permitted is usually scraping a publisher’s site to build massive training datasets for AI. Search engines and AI companies, including Perplexity, have stated that they comply with the protocol, but they aren’t legally obligated to do so.  

Perplexity’s head of business, Dmitry Shevelenko, told TechCrunch that summarizing a URL isn’t the same thing as crawling. “Crawling is when you’re just going around sucking up information and adding it to your index,” Shevelenko said. He noted that Perplexity’s IP might show up as a visitor to a website that is “otherwise kind of prohibited from robots.txt” only when a user puts a URL into their query, which “doesn’t meet the definition of crawling.” 

“We’re just responding to a direct and specific user request to go to that URL,” Shevelenko said.

In other words, if a user manually provides a URL to an AI, Perplexity says its AI isn’t acting as a web crawler but rather a tool to assist the user in retrieving and processing information they requested. 

But to Wired and many other publishers, that’s a distinction without a difference because visiting a URL and pulling the information from it to summarize the text sure looks a whole lot like scraping if it’s done thousands of times a day.

(Wired also reported that Amazon Web Services, one of Perplexity’s cloud service providers, is investigating the startup for ignoring robots.txt protocol to scrape web pages that users cited in their prompt. AWS told TechCrunch that Wired’s report is inaccurate and that it told the outlet it was processing their media inquiry like it does any other report alleging abuse of the service.)

Plagiarism or fair use?

screenshot of Perplexity Pages
Forbes accused Perplexity of plagiarizing its scoop about former Google CEO Eric Schmidt developing AI-powered combat drones.
Image Credits: Perplexity / Screenshot

Wired and Forbes have also accused Perplexity of plagiarism. Ironically, Wired says Perplexity plagiarized the very article that called out the startup for surreptitiously scraping its web content. 

Wired reporters said the Perplexity chatbot “produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them.” One sentence exactly reproduces a sentence from the original story; Wired says this constitutes plagiarism. The Poynter Institute’s guidelines say it might be plagiarism if the author (or AI) used seven consecutive words from the original source work.  

Forbes also accused Perplexity of plagiarism. The news site published an investigative report in early June about how Google CEO Eric Schmidt’s new venture is recruiting heavily and testing AI-powered drones with military applications. The next day, Forbes editor John Paczkowski posted on X saying that Perplexity had republished the scoop as part of its beta feature, Perplexity Pages.

Perplexity Pages, which is only available to certain Perplexity subscribers for now, is a new tool that promises to help users turn research into “visually stunning, comprehensive content,” according to Perplexity. Examples of such content on the site come from the startup’s employees, and include articles like “Beginner’s Guide to Drumming,” or “Steve Jobs: Visionary CEO.” 

“It rips off most of our reporting,” Paczkowski wrote. “It cites us, and a few that reblogged us, as sources in the most easily ignored way possible.” 

Forbes reported that many of the posts that were curated by the Perplexity team are “strikingly similar to original stories from multiple publications, including Forbes, CNBC and Bloomberg.” Forbes said the posts gathered tens of thousands of views and didn’t mention any of the publications by name in the article text. Rather, Perplexity’s articles included attributions in the form of “small, easy-to-miss logos that link out to them.”

Furthermore, Forbes said the post about Schmidt contains “nearly identical wording” to Forbes’ scoop. The aggregation also included an image created by the Forbes design team that appeared to be slightly modified by Perplexity. 

Perplexity CEO Aravind Srinivas responded to Forbes at the time by saying the startup would cite sources more prominently in the future — a solution that’s not foolproof, as citations themselves face technical difficulties. ChatGPT and other models have hallucinated links, and since Perplexity uses OpenAI models, it is likely to be susceptible to such hallucinations. In fact, Wired reported that it observed Perplexity hallucinating entire stories. 

Other than noting Perplexity’s “rough edges,” Srinivas and the company have largely doubled down on Perplexity’s right to use such content for summarizations. 

This is where the nuances of fair use come into play. Plagiarism, while frowned upon, is not technically illegal. 

According to the U.S. Copyright Office, it is legal to use limited portions of a work including quotes for purposes like commentary, criticism, news reporting and scholarly reports. AI companies like Perplexity posit that providing a summary of an article is within the bounds of fair use.

“Nobody has a monopoly on facts,” Shevelenko said. “Once facts are out in the open, they are for everyone to use.”

Shevelenko likened Perplexity’s summaries to how journalists often use information from other news sources to bolster their own reporting. The unfair advantage of AI companies, however, is that they can compile in seconds what it took several journalists hours to create.

Mark McKenna, a professor of law at the UCLA Institute for Technology, Law & Policy, told TechCrunch the situation isn’t an easy one to untangle. In a fair use case, courts would weigh whether the summary uses a lot of the expression of the original article, versus just the ideas. They might also examine whether reading the summary might be a substitute for reading the article. 

“There are no bright lines,” McKenna said. “So [Perplexity] saying factually what an article says or what it reports would be using non-copyrightable aspects of the work. That would be just facts and ideas. But the more that the summary includes actual expression and text, the more that starts to look like reproduction, rather than just a summary.”

Unfortunately for publishers, unless Perplexity is using full expressions (and apparently, in some cases, it is), its summaries might not be considered a violation of fair use. 

How Perplexity aims to protect itself

AI companies like OpenAI have signed media deals with a range of news publishers to access their current and archival content on which to train their algorithms. In return, OpenAI promises to surface news articles from those publishers in response to user queries in ChatGPT. (But even that has some kinks that need to be worked out, as Nieman Lab reported last week.)

Perplexity has held off from announcing its own slew of media deals, perhaps waiting for the accusations against it to blow over. But the company is “full speed ahead” on a series of advertising revenue-sharing deals with publishers. 

The idea is that Perplexity will start including ads alongside query responses, and publishers that have content cited in any answer will get a slice of the corresponding ad revenue. Shevelenko said Perplexity is also working to allow publishers access to its technology so they can build Q&A experiences and power things like related questions natively inside their sites and products. 

But is this just a fig leaf for systemic IP theft? Perplexity isn’t the only chatbot that threatens to summarize content so completely that readers fail to see the need to click out to the original source material. 

And if AI scrapers like this continue to take publishers’ work and repurpose it for their own businesses, publishers will have a harder time earning ad dollars. That means eventually, there will be less content to scrape. When there’s no more content left to scrape, generative AI systems will then pivot to training on synthetic data, which could lead to a hellish feedback loop of potentially biased and inaccurate content. 

More TechCrunch

Pryzm announced its $2 million pre-seed round, led by XYZ Venture Capital and Amplify.LA.

Pryzm is a new kind of defense tech startup: one that helps others win lucrative contracts

Comun, a digital bank focused on serving immigrants in the United States, has raised $21.5 million in a Series A funding round less than nine months after announcing a $4.5…

Fast-growing neobank Comun has secured $21.5M in new funding just months after its last raise

Calm is rolling out a suite of new features to make it easier for people to fit mindfulness into their lives. Most notably, the app is launching “Taptivities,” which are…

Calm’s new Story-like mindfulness exercises offer an alternative to social media

The NotePin, which hits preorder Wednesday, is $169 and comes with a free starter plan or a Pro Plan, which costs $79 per year.

Plaud takes a crack at a simpler AI pin

CoinSwitch, a prominent Indian cryptocurrency exchange, is suing rival platform WazirX to recover trapped funds.

CoinSwitch sues WazirX to recover trapped funds

Web browser and search startup Brave has laid off 27 employees across the different departments, TechCrunch has learned. The company confirmed the layoffs but didn’t give more details about the…

Brave lays off 27 employees

Zepto co-founder Aadit Palicha told a group of analysts and investors on Tuesday that the three-year-old Indian delivery startup anticipates growth of 150% in the next 12 months, a remarkable…

Zepto, snagging $1 billion in 90 days, projects 150% annual growth

VerSe Innovation, India’s content tech startup, has acquired digital marketing firm Valueleaf Group to bolster its presence in the Indian digital ad space.

India’s VerSe buys Valueleaf to boost digital marketing

Astrobotic’s Peregrine lunar lander failed to reach the moon because of a problem with a single valve in the propulsion system, according to a report on the mission released Tuesday.…

One busted valve led to the failure of Astrobotic’s $108M Peregrine lunar lander mission

Meta and Spotify are exploring deeper music integration in Meta’s Instagram app. New findings indicate the companies are testing a feature that would allow users to continuously share what music…

Meta and Instagram spotted developing a new social music-sharing feature

In Latin American countries like Brazil and Chile, messaging platform WhatsApp has become one of the most popular apps to use to buy things online. It was even the e-commerce…

How Techstars, Meta helped profitable LatAm startup Mercately raise a $2.6M seed

Before entrepreneur and investor Mike Lynch died along with six others after the yacht they were on capsized in a storm last week, the party was celebrating Lynch’s victory in…

Will HP still demand $4B from Mike Lynch’s estate?

How many times does the letter “r” appear in the word “strawberry”? According to formidable AI products like GPT-4o and Claude, the answer is twice. Large language models (LLMs) can…

Why AI can’t spell ‘strawberry’

The SEC has updated its limits to the amount of money a “qualified venture fund” can raise to $12 million from $10 million.

The SEC just made life a little easier for smaller VCs

Tinder removed the U.S. military ads, saying the campaign violated the company’s policies.

The US military’s latest psyop? Advertising on Tinder

Welcome to TechCrunch Fintech! This week, we’re looking at the craziness that is Bolt’s proposed fundraise, how much money Synapse’s founder has raised for his new venture, just how much…

Just how much cash does Stripe have?

In an effort to improve its security measures, Lyft announced Tuesday a new rider verification pilot program to help drivers verify riders’ identities and ensure that they are indeed who they say…

Lyft follows in Uber’s footsteps with a rider verification program

Update: The Polaris Dawn launch has been pushed back a day and is now planned for Wednesday, August 28 after a helium leak was detected ahead of its takeoff. After…

Polaris Dawn will push the limits of SpaceX’s human spaceflight program — here’s how to watch it launch live

Meta will be shutting down Spark AR, its platform of third-party AR tools and content, effective January 14, 2025.

Creators are angered by Meta’s Spark AR shutdown, saying they’ll be out of work with little notice

Waymo said Tuesday it will start offering riders 24/7 access to curbside pickups and drop-offs at Phoenix Sky Harbor International Airport terminals 3 and 4 — yet another example of…

Waymo expands its curbside robotaxi service to Phoenix airport

Some believe open source AI is a way to break out of the familiar proprietary software quagmire that the technology has predictably fallen into. Hugging Face’s Irene Solaiman and AI2’s…

Is open source AI possible, let alone the future? Find out at TechCrunch Disrupt 2024

It’s back-to-school season, and that often means a surge in expenses. Or perhaps you’ve recently graduated and are navigating the job hunt. Either way, your wallet might be feeling the…

Students and recent grads: Save on TechCrunch Disrupt 2024 tickets

Snapchat is officially rolling out native support for iPad, the company announced in the app’s latest release notes. Since Snapchat’s launch in 2011, the social networking app has only been…

13 years later, Snapchat finally rolls out native support for iPads

At the end of the six-month effort, the startup is aiming to have prototype parts to show to NASA.

Whisper Aero is working with NASA to bring its ultra-quiet tech to outer space

A group of hackers linked to the Chinese government used a previously unknown vulnerability in software to target U.S. internet service providers, security researchers have found.  The group known as…

Chinese government hackers targeted US internet providers with zero-day exploit, researchers say

Elon Musk’s X has already declared it aims to compete with LinkedIn for job listings and PayPal for payments. Now, it wants to take on the likes of Zoom, Google…

X is testing a video conferencing tool

San Francisco-based data infrastructure startup Cribl has raised $319 million in a Series E funding tranche led by new investor GV (Alphabet’s corporate venture arm) with participation from GIC, CapitalG,…

Data infrastructure startup Cribl raises $319M at a $3.5B valuation

Apple has struck a deal with Airtel to provide the Indian telecom giant’s subscribers with exclusive offers for its music streaming service. The partnership, announced on Tuesday, will also see…

Apple strikes telecom deals to reach more users in India

GrubMarket, the $3.6 billion food delivery and supply chain startup backed by Tiger Global, BlackRock and nearly 100 other investors, has snapped up another food delivery startup on its consolidation…

Food delivery is seeing more consolidation: GrubMarket snaps up FreshGoGo

Coined as the “Everyday Influencer” platform, Mavely is a social commerce app that enables users to earn commissions by sharing and recommending products from more than 1,250 brands, including Adidas,…

Mavely’s platform for everyday influencers is taking off