Izenburua
ASTRAL: Automated Safety Testing of Large Language ModelsEgilea
Beste erakundeak
https://ror.org/00wvqgd19Universidad de Sevilla
Bertsioa
PostprintaDokumentu-mota
Kongresu-ekarpenaHizkuntza
IngelesaEskubideak
© 2025 IEEESarbidea
Sarbide irekiaArgitaratzailearen bertsioa
https://doi.org/10.1109/AST66626.2025.00018Non argitaratua
IEEE/ACM International Conference on Automation of Software Test (AST) Ottawa (Canada), 28-29 April 2025Argitaratzailea
IEEEGako-hitzak
Large Language ModelsODS 9 Industria, innovación e infraestructura
ODS 10 Reducción de las desigualdades
Gaia (UNESCO Tesauroa)
InformatikaLaburpena
Large Language Models (LLMs) have recently gained significant attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as ... [+]
Large Language Models (LLMs) have recently gained significant attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e., different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe test inputs, significantly increasing the number of unsafe LLM behaviors. [-]
Finantzatzailea
Comisión EuropeaGobierno de España
Gobierno Vasco
Programa
HORIZON-CL4-2021-HUMAN-01Plan Estatal 2021-2023 - Proyectos Investigación No Orientada
Ikertalde Convocatoria 2022-2023
Zenbakia
101069364PID2021-126227NB-C22
IT1519-22
Laguntzaren URIa
https://doi.org/10.3030/101069364Sin información
Sin información
Proiektua
Next Generation Internet Discovery and Search (NGI Search)Mejorando el desarrollo, fiabilidad y gobierno de servicios digitales por medio de la colaboración bot-humano
Ingeniería de Software y Sistemas (IKERTALDE 2022-2023)



















