You are here

Using Large Language Models to Code Policy Feedback Submissions

The first session at the ACSPRI 2024 conference is on generative AI, and starts with Lachlan Watson. He is interested in the use of AI assistance to analyse public policy submissions, here in the context of Animal Welfare Victoria’s draft cat management strategy. Feedback could be in the form of written submissions, surveys, or both, and needed to be analysed using quantitative approaches given the substantial volume of submission.

The organisation chose Relevance AI as a tool for this – this is a low code AI solution not unlike ChatGPT, but data is hosted in a private environment and none of the data are used to train Large Language Models. Relevance AI was then used to support the development of a code frame for the analysis of text submissions and determine positive and negative sentiment towards the proposed cat management strategy.

This process required a review of AI output quality, a balanced approach to quality checking, iteration and simulation over small samples of data, and efficacy checks of AI-generated outputs (also against human coding). The first step here was code frame generation, which was done by providing the AI with general context and a standard prompt, which were then applied to sets of 100 responses to the strategy. This produced a range of code frames over several iterations, which were manually reviewed and aggregated into a master list of codes in collaboration with the client, Animal Welfare Victoria.

The second step applied this code frame to the data. Context and a standard prompt including the code frame were provided to the LLM (at this stage, ChatGPT 3.5, which performed poorly, and later ChatGPT 4, which did better but not perfectly), and 8,000 or more submissions coded through this process, over several iterations. This required an onerous review of at least a sample of the coding results (which still saved time compared to fully human coding of all submissions).

Third, Relevance AI was used to assess sentiment (from extremely negative to extremely positive); this was repeated multiple times and an average taken for the final sentiment score. This was also compared to human-generated scores, and this checking showed that AI sentiment scores were highly problematic: negative sentiment towards pet owners was mistaken for negative sentiment towards the proposed strategy; positive sentiment towards cats was mistaken for positive sentiment towards the strategy. The model lacked the nuance to prescribe sentiment contextually.

Relevance AI was useful for determining the relevance of submissions, however; irrelevant submissions could be identified by the extremity of their sentiment, at least, and removed from the dataset following further human review. This aided the process, though the LLM should not be relied upon to do this by itself.