The final (!) session at this wonderful AoIR 2024 conference is on content analysis, and starts with Ahrabhi Kathirgamalingam. Her interest is especially on questions of agreement and disagreement between content codings; the gold standard here has for a long time been intercoder reliability, but this tends to presume a single ground truth which may not exist in all coding contexts.
The concept of ‘constructs of marginalisation’ might be useful here: marginalised people are underrepresented; existing structural power defines who defines such constructs; they are historically and culturally shaped; and explicit as well as ambiguous and evasive language that discriminates and marginalises needs to be considered. The texts to be coded introduce variance in this context, then, and coders may have significant biases that come through in their coding.
This is complicated further if LLMs are also being used in coding: reliability is a general issue, and LLMs have their own biases, but prompt engineering may further affect coding results; indeed, the human-coded data that LLMs were trained on might themselves introduce biases into the process, of course.
Ahrabhi explores this through a particular case study of coding for racism in German news media texts, which also surveyed coders for their own experiences of racism – such experiences directly affected coding results. The project then also asked two LLMs to code the same content from the perspective of persons who had or had not been affected by racism.
This found, first, that in general GPT-3.5 coded more content as racist than GPT-4o; the differences in assumed personas also resulted in substantially different results. On average, GPT-4o was closer to human coding, but systematically diverged from human coders in specific contexts (reporting of crime and marginalised people; quantitative information and migration). All of this affects the reliability of human and LLM-based coding, both individually and in combination.