Cybersecurity & Tech

Why the Data Ocean Is Being Sectioned Off

Gary McGraw, Dan Geer, Harold Figueroa
Wednesday, July 10, 2024, 1:00 PM
Bigger is better approaches in AI create an inexhaustible appetite for users’ data, leading to a rise in user data expropriation, sectioning off of the internet, and “data feudalism.”
Big Data (Bob Mical, CC BY 3.0)

Published by The Lawfare Institute
in Cooperation With
Brookings

Welcome to the era of data feudalism. Large language model (LLM) foundation models require huge oceans of data for training—the more data trained upon, the better the result. But while the massive data collections began as a straightforward harvesting of public observables, those collections are now being sectioned off. To describe this situation, consider a land analogy: The first settlers coming into what was a common wilderness are stringing that wilderness with barbed wire. If and when entire enormous parts of the observable internet (say, Google search data, Twitter/X postings, or GitHub code piles) are cordoned off, it is not clear what hegemony will accrue to those first movers; they are little different from squatters trusting their “open and notorious occupation” will lead to adverse possession. Meanwhile, originators of large data sets (for example, the New York Times) have come to realize that their data are valuable in a new way and are demanding compensation even after those data have become part of somebody else’s LLM foundation model. Who can gain access control for the internet’s publicly reachable data pool, and why? Lock-in for early LLM foundation model movers is a very real risk.

Below, we define and discuss data feudalism, providing context by determining where data needed to create the latest generation of machine learning (ML) models come from, how much we need, who owns it, and who should own it. We describe the data ocean and its constituent parts. We discuss recursive pollution. We wonder if less can be more.                 

First, some definitions.

Machine learning: “We” (meaning computer scientists and practitioners) have been building computer programs for a long time, and we’re pretty good at it. When we know HOW to describe something programmatically, we write a program to do that. Machine learning is what you end up doing when you don’t know HOW to do something in clear enough terms to write a program to do it. After all, if we knew how to solve a certain problem, we would just write a program to do so! 

When we don’t know HOW, we often know WHAT. See that picture? That’s a WHAT (“tiger,” say). If a computer scientist is hoping to achieve a certain computational state, that computational state is a WHAT. When they don’t know HOW, they can use machine learning to take a (huge) pile of WHAT and become a WHAT machine after training. That is, the machine, when appropriately trained, in some sense becomes the WHAT. (Of course, we may well have piled up the WHAT we used to train a machine learning system, yet still we don’t know exactly HOW the computation works after training.)

In other words, don’t throw machine learning at something that you already know HOW to do. Only throw machine learning at stuff where you know lots about the WHAT, but you don’t know HOW.

Large language models: LLMs are auto-associative predictive generators that map an input space of text to an output space through a number of neural network layers that compress and re-represent the input data. LLMs are stochastic by design, so even prompts that a human might identify as being meaningfully identical often result in output that is not. Output from an LLM may appear to be the result of logic, understanding, and reasoning, but it is not.

Ultimately, an LLM is “trained” by statistically characterizing an enormous collection of word sequences (sometimes called a “data ocean”). After that training, when we present the LLM with a new word sequence as a prompt, it replies with what it predicts would be a good sequence to follow that. In short, it can predict anything but knows nothing.

Building an LLM from scratch is prohibitively expensive for non-billionaires—on the order of $100 million—so the world has converged on the idea of creating specific LLM applications by building on top of a class of LLMs called foundation models.

LLM foundation model: An LLM is trained on an enormous corpus of word sequences (by unsupervised learning enhanced with attention mechanisms) to draw out global I/O (input to output) inter-relationships. LLM foundation models are fine-tuned using more specialized inputs, resulting in the modification of layers of the neural network in a quick, more supervised training process. Some LLMs such as GPT-4 then accept sizable input prompts that themselves can be carefully constructed to produce better results. This process is called prompt engineering, and while it gives LLM foundation models a large degree of flexibility in post-training operations, it also creates a problem that consistently vexes LLM engineers—LLMs’ propensity to deliver randomly wrong, unsavory, incorrect, unethical, or otherwise unwanted responses. 

Ultimately, LLM foundation models are black boxes. Since most organizations don’t have the resources (neither money nor data resources) to build their own LLMs, they start with a black box LLM foundation model built by someone else. Choosing the right one is often unobvious. Using foundation models is broadly accessible through simple apps, application programming interfaces (APIs), and the use of natural language prompting. However, prompting and evolving models amount to trying to get results using an undocumented, unstable API with periodically unanticipated behavior. (And just for the record, that makes the job of securing an LLM application exceedingly challenging.)

Feudalism: According to the American Heritage Dictionary, feudalism is “a political and economic system based on the holding of all land in fief or fee and the resulting relation of lord to vassal and characterized by homage, legal and military service of tenants, and forfeiture.” 

Or, in so many words, the monarch owns all the land and grants special individuals (vassals) use of great chunks of that land in exchange for loyalty, taxes, and military support. These individuals do likewise, only the ones under them (serfs) have only their lord (vassal) to report to and by whom to get protection, for which they pay with their work product and military service. Plus, they can’t relocate. 

Data feudalism: So, with these definitions in mind, what exactly is data feudalism? Here’s what ChatGPT said when we asked: “Data feudalism is the concentration of power and control in the hands of a few powerful digital companies, reminiscent of the feudal systems of the Middle Ages where a few powerful lords held all the power and control over their serfs and vassals.” Well, not quite; in the analogy we’re going for, data are land.

Imagine data are land. The internet is like the wilderness—lots of unclaimed land to be exploited by the first comers. And there are a few large public data sets (like public parks). And there are platforms that hoover up data (to get more effective advertising results, build ML models, and so on)—much like institutional buyers assembling large landholdings by acquiring smaller ones through straws.

When data are land, one can take the perspective that some data are the behavior of people, belong to those people, are shared by users under stipulated conditions, and so forth. Such data might be protected under the umbrella of privacy. Other data are created through people working for organizations, belong to authors and organizations, can be bought or sold, and so forth. Such data might be protected under the umbrella of copyright. Either umbrella is a species of property rights. Meanwhile, the various tech “platforms” that can pay the price to build LLM foundation models do so, relying on their extensive collection and distribution mechanisms that turn surveilled human behaviors and work products into data resources. Ostensibly volunteer contributors and ostensibly beneficial platforms are both necessary; the two contribute essentially to the existence of the data resource.

 Why the Fixation on Data?

Simply put, observational data generate great value. This should be obvious enough, since observational data have been the raw material of surveillance capitalism for more than a decade. The primary motivation in piling up data has heretofore been to serve highly targeted ads to internet and mobile phone users. But these data have other uses, such as in electioneering and opinion microtargeting (see, especially, Cambridge Analytica). But that’s almost beside the point. Data turn out to be absolutely necessary for training LLMs, and as the source of economic value of user data extends from its use in targeting to artificial intelligence (AI) training, the internet is becoming less open and user data expropriation is on the rise.

With machine learning, there is software, and, yes, it is occasionally improved. But the software is beside the point. What matters more is not source code but source data. All the predictive decision-making an LLM does is based on numbers condensed out of training data. Where software quality derives from careful construction and rigorous test harnesses, LLM quality comes from source data quality and quantity.

When software is constructed, there is always some careful curation of the “source code,” and that source code is the ultimate documentation of what the software does. Source code repository management tools are many. So are analysis tools used to look for security vulnerabilities in source code. The more careful builders of software products not only maintain the cleanest and least vulnerable source code pool they can but also keep around all the tooling that takes source code from its as-written state to its ready-for-delivery state. Being able to re-create a past version of the software with the same tools used when that past version was prepared for distribution is an essential component of forensics (cause-finding) when vulnerabilities are discovered after distribution. That is sometimes called “forensic readiness.” We need such practices and tools for data, and these in turn require some form of data openness and transparency.

How much do we need? State-of-the-art LLMs require on the order of trillions to tens of trillions of data points. This has led some people to wonder whether we will run out of data. While it is continuously true that 90 percent of the world’s data was collected in the past two years, the idea that data may be running out is not as ridiculous as it might seem. 

So, could data run out? Given a projection from current data appetites and ML improvement, Pablo Villalobos and others suggest that, “if rapid growth in dataset sizes continues, models will utilize the full supply of public human text data at some point between 2026 and 2032, or one or two years earlier if frontier models are overtrained. At this point, the availability of public human text data may become a limiting factor in further scaling of language models.” Scaling laws associated with vision and language models are well studied and thus far seem to imply that not only is more data better, but way more is way better.

Where will data come from? Well, humans, of course. Data are a human artifact. If we limit ourselves to the kind of human data found on social media platforms, blogs, and forums, then we can build estimates around things such as total world population, internet penetration, and data produced per person. Those estimates imply that data will run out if we keep expanding models at the rate we’ve been doing so far, just as Villalobos predicts.

Here’s a list of some of the biggest sources of (unlabeled) data:

  • books
  • news articles
  • scientific papers
  • Wikipedia
  • filtered web content
  • open-source code
  • recorded speech
  • internet users
  • popular platforms (Meta, X, Reddit ...)
  • Common Crawl
  • indexed websites

Big data sets that are well known and commonly used include The Pile, Massive Text, ROOTS, and PaLM. Another, bigger, but lesser-known private data set is Chinchilla.

Who owns the data? This is a matter of active debate, negotiation, division, and angst. Until just lately, most large ML companies were content to share large data piles liberally among themselves. That all stopped when ChatGPT became wildly popular. Since price allocates scarcity, data cannot remain “free” if data become scarce. Land price is all about “location, location, location,” so as data are partitioned off, the price of data can only rise—the squatter’s dream.

Platforms are increasingly limiting data access and sidelining users’ interests, risking a negative impact on the voluntary activities that have created the data troves in the first place! X seems like a very good example, but analysts have also observed user discontent on Stack Overflow and Reddit

Basically, peoples’ contributions are becoming less accessible and being used without real consent, typically a coerced consent, which is very feudal.

Are All Data Created Equal?

So, where did the data come from? Was it collected with its use as an LLM training set front of mind from the get-go? No—the big LLMs scraped as much of the internet as they could reach. If you are comfortable saying that “the internet is full of useful data that was put there innocent of mal-intent,” then the “whole internet training set” is quite likely good enough for making coherent sentences and treating widely enough repeated claims as facts. Going forward, that may be less so as the outputs of LLMs are put back on the internet to be picked up by the makers of new training sets. Entities that scraped the internet no later than 2021 are the only ones whose scrapings are not polluted with output of other LLMs. Remember, characteristics of the training set data are what an LLM is made of.

The dividing line between acceptably good input versus malicious input is often blurred, so it’s difficult to anticipate when things are going to go sideways. Looking at the work of David Evans at the University of Virginia and some others, it is clear that if you are pursuing generalizability—which is the power that you want out of machine learning—your generalizability is dragging along a susceptibility to adversarial manipulation. If that susceptibility is unavoidable but also intolerable, then your design question is how much generalizability can you skip in the name of safety? As is the case in other statistical domains, accuracy (bias) and precision (variability) are locked in a trade-off regime.

As things stand, a small number of foundation models trained by an even smaller number of corporate entities are what is getting the use. While their makers may report that they scraped all of the internet, they nevertheless retain the data that they scraped; it’s their “data lake,” their hunting preserve. Of course, their goal is to make their data lake a data moat (in other words, to use their data, per se, as a sustainable barrier to entry for other potential competitors). Perhaps easier said than done, but an observable strategy often enough. The more tightly held the training data of an LLM, the more the LLM becomes a black box. Black boxes are not analyzable; you either accept them as they are, or you don’t. 

If those data moats—the training data that produce trillion-parameter LLMs—do their job, then the historic analogy that comes promptly to mind is that of a feudal society, and we come to the point of this piece, namely a concentration of power and control in the hands of a few powerful digital companies, reminiscent of the feudal systems of the Middle Ages where a few powerful lords held all the power and control over their vassals and their vassals’ serfs.

Parting the Data Ocean (or Fencing the Data Wilderness)

Surveillance capitalism loves data. Governments love data. Machine learning systems really love data. Data are the new coin of the realm, convertible almost directly into the other modern currency: attention.

Cleaning up the enormous data sets used to train an LLM foundation model is an enormous task, perhaps ultimately undoable. To be sure, nobody has yet built an LLM foundation model using completely cleaned-up data. That means the WHAT we’ve been using so far is full of poison, garbage, nonsense, and noise, much of which is difficult or impossible to scrub out.

There are many ways machine learning can go wrong, but to a first approximation the main reason is data and how the data are internalized by a model. As one example, the statistical term “overfitting” describes when a learning algorithm is more accurate in fitting known data (hindsight) but less accurate in predicting new data (foresight); information from all past experience can be divided into (a) information that is relevant for the future and (b) irrelevant information (noise). All else being equal, the more difficult a criterion is to predict—for example, the higher its uncertainty—the more the noise that exists in past information needs to be ignored. The problem, of course, is determining which part to ignore. A learning algorithm that can reduce the risk of fitting noise is called “robust.” 

Because an LLM is typically constructed with maximum generalizability in mind (all the wisdom of the internet in one place, so to speak), its susceptibility to adversarial data input is insensitive to actions of the consuming user. These are the black box foundation models; they brook no elucidation of why they do what they do. Many users of foundation models might be better served by smaller, simpler, less generalized models. Or a collection of smaller models. The crucial point: Generalizability expands the possible event space—not always a good idea.

What thus follows from the current situation is data feudalism. Organizations that own good clean data are hoarding it because good clean data are rare and because they can. Good clean data have become an asset like baseball cards or patent portfolios to be traded among those in the know. At the same time, “good” and “clean” are not crisply defined, and so data quality differences mean that data assets are not exactly fungible.

The upshot of all this is that data piles once generally accessible by anybody who bothered to look are becoming not quite so easy to access. The scope of accessible data sets is in some important sense shrinking. By our land analogy, castles have moats, but wildernesses do not.

From Wilderness to Superfund Site

The emerging data accessibility problem is pretty obvious when considered solely from the perspective of magnitude, accessibility, and division, but there is one further huge issue making things worse. (And making them worse fast.) It’s what we define as “recursive pollution.” 

LLMs can sometimes be spectacularly wrong, and confidently so. If (when) LLM output is pumped back into the training data ocean (by being put on the internet, for example), some future LLMs will inevitably end up being trained on these very same polluted data. This is one kind of “feedback loop” problem. Ilia Shumilov and colleagues wrote an excellent paper on this phenomenon (also see Sina Alemohammad and colleagues). Recursive pollution is a serious threat to LLM integrity. Machine learning systems should not eat their own output, just as mammals should not consume the brains of their own genus. Wikipedia is a key resource for training LLMs, due to its highly reviewed coverage of many topics, and it represents a largely volunteer, very significant curation effort. This article is not about the debate over using LLMs to author Wikipedia content. Probably not at risk of being fenced in, however recursive pollution rears its head.

So, not only is the data equivalent of arable land not growing fast enough while being actively segmented with data moats, it is also being polluted as you read this.

Doing More With Less

Maybe there is hope. Oddly, we have fixated on a particular form of modeling and deployment strategy as an end (all), probably for the simple reason that it works. After all, LLMs speak to us in our most familiar modality—natural language—and it’s “easy” to use them through an API. Temptingly, rather than having to set up our own ML infrastructure (including modeling approach, data sets, annotation, hardware, evaluators and operators, monitors, and so on), we can just use commercial big vendor foundation models, counting on the feudal lords of the data to do all the hard stuff like protection and administration.

We may well find ourselves in a weird situation where ML services are like the ancient Oracles, and our prompts are like prayers.

In any case, the current narrative (and its associated attention budget) is very favorable to the feudalistic approach and even includes characterizations like the “GPU-rich” and the “GPU-poor.” People have already started making fun of the serfs. We’ve seen identities from members of Hugging Face with “GPU-poor” in their bios as a badge of honor in this newly minted class struggle, and recent efforts by @huggingface also target this divide. Peoples’ take on this concept is an indicator of which class they identify with. “Don’t even try, you are GPU-poor!”

The good news is that there are many other forms of modeling that can be more appropriate for answering certain questions consistent with the domain, thinking about various forms of model-based simulation. Of course, these smaller models are not as “moaty” and certainly not as “general” (in the artificial general intelligence sense). Even when thinking about foundation models, the creation of specialized models using specialized data sets controlled by a particular community of users can make all kinds of sense. The example of the medical pathology community is an interesting one, where careful data set curation by domain experts has been shown to have great impact on model building. This example also speaks to the question of yet-untapped data and model building opportunities. A lot of very good medical data and expertise exist to be exploited. Outside of the medical domain, social scientists are also experimenting with general models versus smaller fine-tuned models to perform tasks such as “coding” data (in the social science way). Unsurprisingly, they find that fine-tuned models are more reliable than general models.

For another source of good news from smaller models, we turn back to language models. There is a growing recognition that open models, open data, and corresponding open processes of curation and training will be required to understand both the potential and the limitations of LLMs scientifically. Open weight model efforts such as the Llama models (and more recently Mistral and Mixtral models) are a step forward and have allowed for greater experimentation and progress. Recently, the OLMo project has gone significantly further by releasing data and tools and models representing all the steps in the pipeline process for building an LLM, further opening the black box.

In a statistical situation, where you have a complex outcome space and complex data going in, you have a choice: (a) Am I trying to understand causality? or (b) Am I trying to exercise control? Do I want to run the U.S. economy (control), or do I want to understand what this gene does (causality)? If you’re going for causality, you want as few input variables as possible; you want parsimony and Occam’s razor is your guide. If, by contrast, you’re going for control, throw in everything you’ve got but do not then take the parameter values that appear in the model for control and treat them as if they are meaningful because in the end they’re completely and simply artifacts of the data used to train the model. The model becomes the data, so saying, for example, that “the most important input is this one because its coefficient in the equation is the biggest” fails hard. Unless and until you have put real effort into getting the number of parameters to their absolute minimum (statistical independence), the parameter values in your model’s final equation have an effect but have no meaning. That differentiation between parsimony and control is crucial. If you say, “We’re gonna use it for control,” that’s fine, but don’t make any pronouncements or decisions based on the relative magnitude of the parameters that appear in the final equation that does your calculations. They are not meaningful. This caveat confounds the search for explainability in generative AI systems.

We are here to make a point: Machine learning is mighty and comes with risks that are in no way limited to the applications of ML models and their users but also (and perhaps most important) include the relationship between foundation models and the broader society that creates the data the models are trained on, and that requires reliable information to function. Yes, there are impacts of foundation models that would meet a definition of “bad” with which nearly everyone would agree. Discussion of this reality is springing up everywhere all at once, and we are content to let those debates proceed as they will.

Meanwhile, just know that it’s training data all the way down.


Gary McGraw is co-founder of the Berryville Institute of Machine Learning, where his work focuses on machine learning security. He is a globally recognized authority on software security and the author of eight best-selling books on this topic. His titles include “Software Security,” “Exploiting Software,” “Building Secure Software,” “Java Security,” “Exploiting Online Games,” and six other books; and he is editor of the Addison-Wesley Software Security series. McGraw has also written over 100 peer-reviewed scientific publications. He serves on the advisory boards of Calypso AI, Legit, Irius Risk, MaxMyInterest, and Red Sift.  He has also served as a board member of Cigital and Codiscope (acquired by Synopsys) and as adviser to CodeDX (acquired by Synopsys), Black Duck (acquired by Synopsys), Dasient (acquired by Twitter), Fortify Software (acquired by HP), and Invotas (acquired by FireEye). McGraw produced the monthly Silver Bullet Security Podcast for IEEE Security & Privacy magazine for 13 years. His dual Ph.D. is in cognitive science and computer science from Indiana University, where he serves on the Dean’s Advisory Council for the Luddy School of Informatics, Computing, and Engineering.
Dan Geer has a long history. Milestones: The X Window System and Kerberos (1988), the first information security consulting firm on Wall Street (1992), convenor of the first academic conference on mobile computing (1993), convenor of the first academic conference on electronic commerce (1995), the “Risk Management Is Where the Money Is” speech that changed the focus of security (1998), the presidency of USENIX Association (2000), the first call for the eclipse of authentication by accountability (2002), principal author of and spokesman for “Cyberinsecurity: The Cost of Monopoly” (2003), co-founder of SecurityMetrics.Org (2004), convener of MetriCon (2006-2019), author of “Economics & Strategies of Data Security” (2008), and author of “Cybersecurity & National Policy” (2010). Creator of the Index of Cyber Security (2011) and the Cyber Security Decision Market (2012). Lifetime Achievement Award, USENIX Association, (2011). Expert for NSA Science of Security award (2013-present). Cybersecurity Hall of Fame (2016) and ISSA Hall of Fame (2019). Six times entrepreneur. Five times before Congress, of which two were as lead witness. He is a Senior Fellow at In-Q-Tel.
Harold Figueroa is co-founder of the Berryville Institute of Machine Learning. Previously he has directed machine intelligence research efforts for intelligence applications leading efforts to integrate advances in machine learning and artificial intelligence, language sciences, and network science into products with applications in intelligence. In a previous life at the Cornell Lab of Ornithology, he developed a broadly used software platform and analysis techniques for environmental monitoring through sound. Before that, he earned a master’s degree and worked toward his doctorate in applied math at the Cornell Center for Applied Mathematics, where he was involved in research and teaching in image and signal processing, numerical optimization, and epidemiology and ecology.

Subscribe to Lawfare