Information loss in speech across languages

26 Apr 2023

It is evident to anyone exposed to different languages that not all languages are spoken at the same speed. English speakers, for instance, talk more slowly than Spanish speakers.

However, I once read an article that showed that despite the disparity of speed, measured in syllables per second, the amount of information transmitted per second is roughly constant across languages.¹ This seemed very interesting to me.

I was then thinking about the robustness to mishearing across languages. In other words, if a part of a sentence is misheard – because a word was mispronounced, for instance – is the meaning of the sentence roughly intelligible?

Psycholinguists frame this as intelligibility under masking noise or lexical deletion, a property that can be measured experimentally².

My limited knowledge of languages – I only speak three, insignificant compared to the number of languages spoken worlwide – doesn’t allow me to have a good overall idea of where languages stand on this. However, I suspect that English is well-placed in this regard. At least compared to Spanish, English is more robust to mishearing individual words.

When part of the sentence is mispronounced or hasn’t been heard properly, it is usually individual words that are dropped. The habit of adding information about subject, action, tense etc to the verb in Spanish makes understanding the sentence more difficult when a word is dropped as a consequence of not understanding it. Consider for example the sentences

(Sp) Iré a París.

(En) I will go to Paris.

The word “Iré” contains information about the subject (Yo), action (ir) and tense (future). Each of these is split into its own word in English: “I will go”. If the word “Iré” is lost in communication, the recipient will understand

(Sp’) [...] a París.

This is not enough to reconstruct the meaning. However, in English if a word is lost, we get either

(En’) [...] will go to Paris,

(En’’) I [...] go to Paris,

(En’’’) I will [...] to Paris.

In (En’), no meaning is really lost. In (En’’) and (En’’’), some meaning is lost, but part of the meaning is still kept. With the aid of the context of the conversation, a fluid conversation can be had without misunderstandings. This is why it’s easier to understand foreigners with a difficult English accent than foreigners with a difficult Spanish accent.

Further cross‑linguistic examples support the same pattern:

(It) Andrò a Parigi.

(Jp) パリに行くよ。(Pari ni iku yo).

In Italian, the synthetic andrò once again fuses person, tense and lexical stem, whereas in Japanese the lexical verb 行く (iku, “to go”) carries only the action, with subject left implicit and destination marked by the postposition に. If 行く is masked, listeners still recover “...to Paris”, yet must infer the verb from context.

Typologists have linked such “information packaging” differences to morphological type: analytic languages like English distribute grammatical categories across separate words, while fusional languages like Spanish compress them³.

Taken together, these observations suggest that while languages equalise average information rate, they vary in their robustness to information loss, like lexical deletion.

Notes

¹ Coupé, Christophe, et al., “Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche” (Science advances 5.9: eaaw2594, 2019).

² See for example C. Connine, C. Clifton and A. Cutler, “Effects of lexical stress on phonetic categorization” (Phonetica, 44(3): 133-146, 1987).

³ P. Trudgill, Sociolinguistic Typology: Social Determinants of Linguistic Complexity (Oxford University Press, 2011).