Thesis: AI Agents for the Extraction of Chemical Toxicity Data from Scientific Literature

OmrådeGöteborg
Publicerad2025-10-08
Ansök senastÖppet tills vidare

Om jobbet

High Level Description

Chemical pollution globally threatens human health and ecosystems. Chemical hazard data is critical to detect and mitigate potential negative impacts at an early stage. Recently, computational tools, such as transformers, have been used to accurately predict, e.g., chemical toxicity towards the environment. The training of such models relies on large amounts of toxicity assay data being extracted from scientific publications, reports, and safety documents, and collected in open-access databases such as ECOTOX.

This master's thesis aims to investigate how AI (LLM) agents can be used to automatically extract chemical toxicity data from scientific literature, enabling more accurate hazard prediction and supporting effective environmental legislation.

Project Description

The project will involve designing and implementing AI agents that can process scientific publications, reports, and safety documents to extract structured chemical toxicity data. Students will develop a proof-of-concept pipeline that accepts a list of publications as input and output validated data, with the ECOTOX database serving as a reference for evaluation.

The research will focus on handling noisy PDFs (including figures, tables, and OCR errors), designing evaluation metrics for extraction accuracy and traceability, and integrating domain-specific validation against ECOTOX. Depending on student interest, the project may also include fine-tuning LLMs to optimize extraction performance.

Who are we looking for?

We are looking for two students who want to write a 30 credit MSc thesis during the Spring of 2026. You should have:
  • Required: Some programming experience (Python, pipelines, basic ML).
  • Nice-to-haves: experience in OpenWebUI (or similar), web-scraping, NLP, pipelines, databases, or MLOps tools.

Students have most likely studied a master's program in computational science or a program that involves software development.


    Purpose
    The purpose of this research is to explore whether AI agents can overcome the bottleneck of manual toxicity data extraction. By creating accurate, traceable, and automated extraction pipelines, the thesis aims to enable large-scale hazard prediction and contribute to more effective environmental protection.

    An Exciting Journey with Knightec Group
    Semcon and Knightec have joined forces as Knightec Group. Together, we are Northern Europe's leading strategic partner in product and digital service development. With a unique combination of cross-functional expertise and a holistic business understanding, we help our clients realize their strategies - from idea to complete solution.

    Practical Information
    This is a master thesis position, located at our office in Lindholmsallén 2, Gothenburg. Start date as agreed.

    Please submit your application as soon as possible, but no later than 2025-11-30. If you have any questions, you are welcome to contact Magnus Svensson. Note that due to GDPR, we only accept applications through our careers page.

    Knightec Group AB

    FöretagKnightec Group AB
    Visa alla jobb för Knightec Group AB