Alvaro Rivas

Can language models develop a creole?

31 Oct 2025

When two populations that speak mutually unintelligible languages come into sustained contact (often through trade, migration, or colonisation), pressures for communication can give rise to a pidgin language. A pidgin is a contact language that develops with a simplified grammar and a lexicon drawn largely from the languages in contact, typically from the socially dominant one. Crucially, a pidgin has no native speakers: it functions as an auxiliary language, used by members of each group as a second language to facilitate mutual understanding.1

Eventually, a pidgin may develop into a fully fledged language with a full vocabulary and grammar, and become the first language of a community. When this happens, a creole language is formed.2 Examples include Haitian Creole (French-based), Jamaican Patois (English-based), Krio (English-based), and Nubi (Arabic-based).3 Creoles can emerge with remarkable speed, as intense contact situations create strong communicative pressures that favour rapid grammatical expansion and stabilisation.4

Creoles emerge when speakers of different languages, faced with the necessity of communication, creatively assemble a new linguistic system. Language models, too, are adaptive systems that generate and internalise patterns from linguistic input.5 One may then ask: can two language models, trained on different languages, develop a creole language when they come into contact?

When two language models trained on different languages are made to interact by completing each other's texts, they simulate the contact and interaction between communities of speakers of mutually unintelligible languages. If we then allow the models to learn from these exchanges over successive generations, we may observe processes analogous to pidginisation and creolisation: lexical borrowing, syntactic simplification, and eventual convergence towards a shared, rule-governed code. The experiment therefore offers a way to explore whether the pressures that drive linguistic unification in human communities can yield similar structural outcomes in artificial agents.

To explore this, I propose the following experiment.

The basic experiment

Take two similar corpora, one in LanguageA and the other in LanguageB. Train two language models of identical architecture and size on each language, ModelA and ModelB. Each model internalises the grammar of its language, but is ignorant of that of the other language. The two models are then brought into contact through a process of inter-generational text exchange.

In each generation, the models engage in a fixed number of episodes of interaction. In half of these, ModelA generates an initial text, which ModelB then completes. In the other half, the roles are reversed: ModelB begins, and ModelA continues. The resulting set of mixed-language texts forms a small corpus of interactions between LanguageA and LanguageB: the linguistic output of that generation.

Once the mixed-language corpus is obtained, we fine-tune each model on it, simulating the process by which speakers of each language are first made aware of the other language and begin to learn from one another's utterances. Once the fine-tuning is completed, we obtain the next generation of models.

We repeat the process over successive generations: each time we produce a mixed-language corpus of texts started by one of the models and completed by the other, and we fine-tune both on the resulting shared corpus. This iterative cycle of exchange and adaptation continues for multiple generations.

Over time, each model starts to learn from the other's language and we may test whether the two models' languages converge and develop a stable, mutually intelligible code, a kind of artificial creole.

Evaluating the new contact language

After a sufficient number of generations, or throughout the evolution of the models, we might evaluate whether their outputs exhibit signs of convergence such as increased mutual intelligibility, lexical borrowing, grammatical regularisation, or the emergence of consistent structural patterns distinct from either original language.

We can do this through two complementary lenses: computational convergence metrics and linguistic–typological diagnostics.

Computational convergence metrics

Linguistic–typological diagnostics

Further improvements to the experiment

There are a few improvements or tweaks we can make to the basic experiment:

Possible outcomes

I anticipate a few potential outcomes of this experiment.

The simplest possibility is that there is a collapse, and the models fail to develop a consistent shared code. The models fail to communicate and understand each other, and exchanges degenerate into incoherent sequences. This is akin to two communities trying to communicate with each other, but failing to make themselves understood.

Another possibility is that one of the languages may come to dominate, with the other gradually adapting to it. This would be reflected in the evaluation metrics proposed above: convergence would be asymmetric and cross-perplexity, lexical borrowing, and other metrics would be skewed towards one of the languages.

Alternatively, the models might produce mixed-language sequences, alternating or blending lexical and grammatical material from both sources without regularisation. This resembles code-switching or the formation of a pidgin that remains a flexible contact variety rather than a fully stabilised system.

However, it is also possible that under suitable conditions the interactions between the models could yield a new, stable linguistic system distinct from either parent language but drawing from both. The resulting code might exhibit typical features of creoles: reduced irregular morphology, analytic tense–mood–aspect marking, fixed word order, and lexical borrowing from both sources. This would constitute the strongest parallel to natural-language creolisation.

I am not suggesting that this process faithfully reproduces how creoles emerge in human societies, nor that it captures the full sociolinguistic and cognitive realities of creolisation. Real-world creoles arise through complex sociohistorical processes of migration, power, and identity, which the proposed experiment does not capture. Nevertheless, the experiment remains linguistically interesting because it isolates, in a controlled and observable way, the structural dynamics that accompany language contact: borrowing, simplification, convergence, and the creation of new grammatical regularities. By observing how such processes unfold in artificial agents exposed to comparable pressures, we can gain insight into the general principles that govern how linguistic systems adapt and reorganise under contact, and thus illuminate, in abstract form, the mechanisms that make human language so remarkably self-organising.

Notes

1 J. Holm, An introduction to pidgins and creoles (Cambridge University Press, 2000).

2 Ibid.

3 S.M. Michaelis et al., The Atlas of Pidgin and Creole Language Structures (Oxford University Press, 2013).

4 S.G. Thomason and T. Kaufmann, Language Contact, Creolization, and Genetic Linguistics (University of California Press, 1992).

5 See for example S. Gururangan, “Don't Stop Pretraining: Adapt Language Models to Domains and Tasks” (arXiv:2004.10964, 2020).

6 J. Holm, An introduction to pidgins and creoles (Cambridge University Press, 2000).

7 Ibid.

8 L. Lewis, Convention: A Philosophical Study (John Wiley & Sons, 2002).