AI training data and identity

Every large AI system is trained on a corpus of human-generated content. That corpus is not neutral raw material — it is a compression of human expression, values, identities, and social arrangements into a statistical artifact that then shapes how AI systems understand, generate, and make inferences about people. The question of what goes into that corpus, whose expression is represented and whose is absent, how consent is or is not obtained, and what it means when a person's identity is encoded into a model without their knowledge or agreement — these are among the most consequential governance questions of the current era and among the least resolved. At collective scale, AI training data is not just a technical question about model performance. It is a question about whose humanity gets to shape the intelligent systems that are increasingly mediating collective life.

The scale of contemporary AI training corpora is genuinely unprecedented. Large language models have been trained on hundreds of billions of words scraped from the internet, books, academic publications, code repositories, and countless other sources — representing, in aggregate, the largest appropriation of human cultural expression in history. The individuals who created that expression — writers, thinkers, coders, artists, ordinary people who posted on social media or maintained blogs — generally did not consent to the use of their work as training material and receive no compensation or recognition when their stylistic patterns, factual contributions, or creative innovations are reproduced through AI systems trained on their work. This is not merely an intellectual property question, though it is that as well. It is a question about the relationship between human expressive identity and the systems that claim to represent human knowledge and capability.

The identity dimensions of AI training data operate at several levels. At the individual level, a person whose distinctive writing style, professional expertise, or personal history is heavily represented in training data may find that AI systems reproduce their work in ways they never authorized, incorporating their identity into a commercial product they did not choose to contribute to and from which they derive no benefit. At the group level, communities whose expression is underrepresented in training corpora — because their languages are less common on the internet, because they have historically faced barriers to digital participation, because their cultural forms are not well-captured by text-based scraping — find that AI systems trained without their contributions perform poorly when applied to tasks involving their communities, encode assumptions derived from better-represented groups, and systematically reproduce the representational asymmetries of the training data in their outputs.

Law 2 — the law of pattern and correspondence — is directly implicated here. AI training encodes patterns. Those patterns correspond to the social arrangements, power structures, and representational priorities that shaped the training corpus. If the training corpus over-represents English-language, Western, educated, urban, economically secure, and male expression — which available evidence suggests it does — then the AI systems built on it will encode those patterns as default human experience, treating deviations as marked categories requiring special handling. This is how AI training data launders social inequality into algorithmic infrastructure: the patterns of historical marginalization become the baseline from which AI systems reason, and outputs that reproduce those patterns appear neutral because they reflect the statistical norms of the training data.

The consent dimension is particularly complex. The internet-scale scraping that produces AI training corpora involves no individual consent from content creators. Terms of service agreements — which users typically never read and would not understand as licensing their content for AI training — have been retroactively interpreted to permit such use, though this interpretation is being contested in multiple ongoing legal cases. The GDPR provides some purchase on this question: personal data used in AI training must have a lawful basis, and processing must respect data subjects' rights including, in principle, the right to object. But enforcement of these provisions against AI training has been halting and inconsistent, partly because of the difficulty of removing specific individuals' data from already-trained models, and partly because of the enormous commercial stakes involved in restraining the development of AI systems that have rapidly become core economic infrastructure.

Law 5 — the law of integration and wholeness — is implicated in the deeper question of what happens to cultural identity when the collective expression of a culture is ingested into a commercial AI system. The oral traditions, specialized vocabularies, contextual knowledge, and relationship-embedded meaning that constitute much of what makes cultures distinctive are poorly captured by text scraping and are therefore systematically underweighted in the pattern structures of large language models. When AI systems trained primarily on Western internet text are deployed globally — as writing assistants, educational tools, healthcare interfaces, legal support systems — they carry with them the cultural assumptions of their training data and impose those assumptions on users whose own cultural frameworks are inadequately represented. This is not accidental; it is the systematic consequence of a training data governance model that treats comprehensiveness of scale as a substitute for genuine representational diversity.

Stewardship of AI training data — Law 4's core obligation in this domain — requires confronting several interrelated governance challenges. First, who has the authority to authorize or prohibit the use of specific content for AI training? Current practice assumes that publicly posted content is freely usable; GDPR and emerging copyright doctrine complicate this assumption without yet resolving it. Second, what obligations do AI developers have toward people whose data is used in training — rights of access, correction, removal, compensation? Third, how should the collective interest in the cultural heritage embedded in training data be represented in governance decisions made primarily by commercial AI developers? Fourth, what mechanisms can ensure that training data is sufficiently representative to produce AI systems that serve diverse populations rather than encoding the preferences of data-rich groups?

These are not abstract questions. The AI systems being built now on largely unaccountable training data governance will be operating for decades. The identity patterns encoded in them will shape what people see, what information they can access, what opportunities are presented to them, and how they are represented in systems that matter. Getting training data governance right is not a technical footnote to AI development — it is a foundational condition for building AI systems that serve collective human flourishing rather than merely amplifying the identity patterns of whoever happened to produce the most internet content.