A Process for Creating Derivative Archival Datasets from Sensitive Hacktivist Content

Snurb — Thursday 19 March 2026 21:28

Internet Content Preservation | Social Media Access Days 2026 | Liveblog |

The final session at the Social Media Access Days at the German National Library focusses on the question of systemic risks, which is a key criterion for the approval of DSA data access requests under EU regulations. We start with Hanna Gawel, whose interest is in archiving hacktivism as a part of digital heritage. Hacktivism may be regarded as a high-risk activity, and therefore requires particular care from archivists.

The content to be archived here might include screenshots of defacements, leaked manifestos, memes, protest videos, and various other forms of often very ephemeral materials; there is a need to create secure procedures for its archival, and Hanna proposes the concept of derivative collections to address this.

Such derivatives might include redaction and anonymisation in order to reduce legal and ethical risks, for instance; some of this might be done via LLMs, too. A derivative collection is thus a secondary, modified dataset which down-samples from the original artefacts. The collection is specifically created to enable scholarly research on politically sensitive, ephemeral, and legally uncertain hacktivist materials and practices.

Hanna’s project specifically explores three hacktivist campaigns: Anonymous’s Project Chanology; WikiLeaks’ Cablegate; and the Syrian Electionic Army defacements. Raw data were sourced from open repositories including the Internet Archive and WikiMedia Commons; these were manually evaluated to flag sensitive personal data, copyright issues, or security risks; from these, derivatives that applied the necessary modifications were created; and these derivatives were further enriched with metadata covering custom descriptors such as the risk category; and finally a tiered access process was established via the Zenodo repository.

The next question, then, is whether and how this process might be further automated. The manual process is time-consuming and repetitive; the project created a multimodal prototype in Google AI Studio to build a scalable and auditable infrastructure for the process. This draws on a matrix of transformative actions which might be applied to the data – e.g. video down-sampling, code alteration, redaction of personal names, or an adjustment of defacement images to reduce colour vibrancy.

Implemented as a tool, this enables users to upload their content, have the tool analyse the content and apply selected transformations, and save the derived content into a new dataset. Such modifications are then also stored in the extended metadata information attached to the content. Finally, then, the derived dataset can be uploaded to Zenodo for sharing.

Such a derivation process enables digital archivists to navigate the fine line between preserving important content and managing the significant legal and ethical risks; this represents a practicable compromise between protecting the data and maintaining their usability for research. This may also address legal challenges related to privacy, intellectual property, and malicious content concerns.

Such challenges are typical for counterculture media, which are affected by ethical challenges such as the amplification of victimisation, the dilemma of consent, and the conundrum of dual use. LLMs can help with these processes, but must not replace the archivists themselves; humans must remain in the loop. When used in this way, this derivation process can address gaps in current archival workflows, and provide a multi-tiered approach to accessing such sensitive data.

80 views