Wissenschaft

Forscher berichten, dass KI-Modelle, die hauptsächlich auf Daten des globalen Nordens trainiert wurden, regionale Wörter aus dem mittleren Westen und Nordosten Brasiliens als statistisches Rauschen behandeln – und argumentieren, dass zur Behebung dieses Problems mehr als nur regionale Datensätze erforderlich seien; Es erfordert die Behandlung von Daten als kulturelles Bedeutungssystem.

23.05.2026

View 5 Comments

5 Kommentare

Cycl_ps on 23.05.2026 2:56 p.m.

I mean, go figure the Brazilian model speaks better Portuguese. Any reason to think that Portuguese texts are anything more than “statistical noise” in the training set?
WTFwhatthehell on 23.05.2026 5:28 p.m.

>This article examines the processes of construction of meaning in generative AI through the lens of discursive semiotics, focusing on how Big Data and datafication operate as semiotic regimes. Drawing upon the concepts of semiotic practices and forms of life the analysis describes how the intangible and dynamic process of datafication configures practical scenes that, once stabilized within Big Data, privilege particular forms of life.

This reads like some kind of parody.

Taking a look at their methods it may still be parody. They basically played „what do I have in my pocket“ with chat models.

Asking chatgpt and a Brazilian model about an obscure Brazilian slang term with zero context then building a narrative about how its „cultural erasure“ that chatgpt brings up other historical uses of the same term rather than what the academic was thinking of.

Big surprise: the Brazilian model assumes you’re talking about something in the context of Brazil.
TheRealPomax on 23.05.2026 7:08 p.m.

Wow, treating data as what it actually is? Heaven forbid we do that, we’re already losing billions.
MadScience_Gaming on 24.05.2026 1:45 a.m.

It IS true that fixing the problems with ‚AI‘ rewrite it to fundamentally be a different sort of thing than it is.
DailyBreads on 24.05.2026 5:53 a.m.

AI trained mostly on Global North data will naturally treat regional Brazilian language as “noise” because it sees dominant cultures as the default baseline. The bigger issue is that language isn’t just vocabulary — it carries identity, history, humor, and context. You can’t fully fix that by just feeding the model more words.