ASTRAL: Automated Safety Testing of Large Language Models

Ugarte Querejeta, Miriam; Valle Entrena, Pablo; Parejo, Jose Antonio; Segura, Sergio; Arrieta, Aitor

dc.contributor.author	Ugarte Querejeta, Miriam
dc.contributor.author	Valle Entrena, Pablo
dc.contributor.author	Parejo, Jose Antonio
dc.contributor.author	Segura, Sergio
dc.contributor.author	Arrieta, Aitor
dc.date.accessioned	2025-11-25T11:44:39Z
dc.date.available	2025-11-25T11:44:39Z
dc.date.issued	2025
dc.identifier.isbn	979-8-3315-0179-2	en
dc.identifier.issn	2833-9061	en
dc.identifier.other	https://katalogoa.mondragon.edu/janium-bin/janium_login_opac.pl?find&ficha_no=200362	en
dc.identifier.uri	https://hdl.handle.net/20.500.11984/13991
dc.description.abstract	Large Language Models (LLMs) have recently gained significant attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e., different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe test inputs, significantly increasing the number of unsafe LLM behaviors.	en
dc.language.iso	eng	en
dc.publisher	IEEE	en
dc.rights	© 2025 IEEE	en
dc.subject	Large Language Models	en
dc.subject	ODS 9 Industria, innovación e infraestructura	es
dc.subject	ODS 10 Reducción de las desigualdades	es
dc.title	ASTRAL: Automated Safety Testing of Large Language Models	en
dcterms.accessRights	http://purl.org/coar/access_right/c_abf2	en
dcterms.source	IEEE/ACM International Conference on Automation of Software Test (AST)	en
local.contributor.group	Ingeniería del software y sistemas	es
local.description.peerreviewed	true	en
local.identifier.doi	https://doi.org/10.1109/AST66626.2025.00018	en
local.contributor.otherinstitution	https://ror.org/00wvqgd19	es
local.contributor.otherinstitution	https://ror.org/03yxnpp24	es
local.source.details	Ottawa (Canada), 28-29 April 2025	en
oaire.format.mimetype	application/pdf	en
oaire.file	$DSPACE\assetstore	en
oaire.resourceType	http://purl.org/coar/resource_type/c_c94f	en
oaire.version	http://purl.org/coar/version/c_ab4af688f83e57aa	en
dc.unesco.tesauro	http://vocabularies.unesco.org/thesaurus/concept450	en
oaire.funderName	Comisión Europea	en
oaire.funderName	Gobierno de España	en
oaire.funderName	Gobierno Vasco	en
oaire.funderIdentifier	https://ror.org/00k4n6c32 / http://data.crossref.org/fundingdata/funder/10.13039/501100000780	en
oaire.funderIdentifier	https://ror.org/038jjxj40 / http://data.crossref.org/fundingdata/funder/10.13039/501100010198	en
oaire.funderIdentifier	https://ror.org/00pz2fp31 / http://data.crossref.org/fundingdata/funder/10.13039/501100003086	en
oaire.fundingStream	HORIZON-CL4-2021-HUMAN-01	en
oaire.fundingStream	Plan Estatal 2021-2023 - Proyectos Investigación No Orientada	en
oaire.fundingStream	Ikertalde Convocatoria 2022-2023	en
oaire.awardNumber	101069364	en
oaire.awardNumber	PID2021-126227NB-C22	en
oaire.awardNumber	IT1519-22	en
oaire.awardTitle	Next Generation Internet Discovery and Search (NGI Search)	en
oaire.awardTitle	Mejorando el desarrollo, fiabilidad y gobierno de servicios digitales por medio de la colaboración bot-humano	en
oaire.awardTitle	Ingeniería de Software y Sistemas (IKERTALDE 2022-2023)	en
oaire.awardURI	https://doi.org/10.3030/101069364	en
oaire.awardURI	Sin información	en
oaire.awardURI	Sin información	en
dc.unesco.clasificacion	http://skos.um.es/unesco6/120317	en

Files in this item

Name:: ASTRAL Automated Safety Testing ...
Size:: 625.1Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Conference papers - Engineering [469]

Simple record