Experiences in Using LLMs to Code Open-Ended Survey Responses

Snurb — Thursday 16 October 2025 23:15

We start the first day of the AoIR 2025 conference proper with a panel on LLMs in research that involves several members of my QUT team. We start with a paper by Paul Pressmann, though, whose focus is on using LLMs in processing open-text responses from survey studies. The interest here is especially in questions of polarisation.

The data for this come from the POLTRACK project, which investigates the interrelations between individualised online information environments and polarisation. This combines Web tracking and surveys of some 2,000 participants. The survey component includes both closed- and open-ended questions that are used to measure the degree of polarisation towards certain topics (e.g. climate change, and specific climate policies).

Closed-ended questions are easy to quantify, but only capture positions on these issues but not the reasoning leading to those positions; open-ended survey responses explore such reasoning, and the project has some 31,000 such responses on issues such as climate change, gender-inclusive language, trans rights, and the war in Ukraine. The form and format of such questions varies widely across participants, from a few short words to much longer comments.

How can we work with and quantify such diverse open-ended responses, then? What might a five-point scale that quantifies the direction and intensity of participant stances on these issues look like, and how might it be tested across these diverse issues? What LLM prompts might be used to help with this, and how do the various available LLM options perform on these tasks?

The classification scale itself ranges from strongly negative (rejection, strong language, conspiracy theories) to strongly positive (acceptance, calls for action), and the first generic LLM prompt used defines the output format, asked for a judgment stance direction, stance intensity, and provided some generic examples of what this might look like; GPT-5-mini performed most strongly against human-coded reference data and generalised better across all four topics.

For an individual topic like climate change, though, Mistral and Gemini performed better, and GPT did worse. Model validity thus can depend on the specific topical domain. A refined prompt with issue-specific instructions showed Mistral and Gemini performing best, and GPT again performing more poorly.

The project then also tested the open-ended stances against the closed-ended responses. Correlation was substantive, showing that the LLM coding of open-ended results aligns well with such closed-ended responses. LLM coding thus approximates the performance of human coders; it can reveal depth in such responses, but may also be biased towards the more expressive and prolix respondents than those giving short answers only. LLMs will likely not replace human researchers and coders, then, but can substantially assist them.

15 views