Thesis: Out-of-Distribution Detection Techniques on Trained Chemical Transformer Models

OmrådeGöteborg
Publicerad2025-10-15
Ansök senastÖppet tills vidare

Om jobbet

High Level Description

Recent advances in deep learning have made it possible to represent chemical structures as dense, high dimensional embeddings where the AI model has captured subtle relationships from training samples. These embeddings are used to predict various chemical properties, crucial in fields such as drug discovery and chemical risk assessment. However, in real-world scenarios these models often encounter input samples that are outside of the scope of the model, so called out-of-distribution (OOD) samples. The detection of OOD samples is critical in order to guarantee prediction accuracy and reliability.

Project Description


This thesis aims to systematically evaluate and categorize recent advances OOD detection methods mentioned in scientific literature and evaluate their applicability on chemical embeddings from a transformer model trained for chemical toxicity prediction [1]. The trained transformer model represents a chemical structure as textual-tokens, same as modern LLM:s, and updates their token embedding iteratively over several layers. The final layer outputs a single embedding which is used for toxicity prediction.

The central focus of the thesis is to investigate how OOD detection can be applied on the token embeddings to quantify how far off a new chemical lie compared to its in-distribution. Using energy based or distance-based measures, such as cosine similarity, the project aims to evaluate OOD detection applied on the embedding vectors and evaluate the applicability of the methods on TRIDENT-models [1,2]. The data to be used comprises ~10 000 chemicals stemming from the ECOTOX database [3].

Who are we looking for?

We are looking for students who want to write a 30 credit MSc thesis. You should have:
  • Required: Programming experience (Python), basic understanding of AI/ML concepts, interest in both software and hardware integration
  • Nice-to-haves: Experience with DNN Architecture, PyTorch, LLMs and LLM APIs, Statistics

Students should have studied computer science, AI/ML, robotics, or related fields where software and algorithms are relevant. An interest in data science is helpful but not required.


    Purpose
    The purpose of this research is to explore the usage of OOD detection in Life Science by exploring existing state-of-the-art and apply it on a real-world scenario. By creating accurate OOD detection methods, this thesis aims to contribute towards more trustworthy AI models that can be incorporated in data-driven life science.

    An Exciting Journey with Knightec Group
    Semcon and Knightec have joined forces as Knightec Group. Together, we are Northern Europe's leading strategic partner in product and digital service development. With a unique combination of cross-functional expertise and a holistic business understanding, we help our clients realize their strategies - from idea to complete solution.

    Practical Information
    This is a master's thesis position, located at our office in Gotheburg. Start date January 2026. Please submit your application as soon as possible, but no later than 2025-11-30. If you have any questions, you are welcome to contact Magnus Svensson. Note that due to GDPR, we only accept applications through our careers page.

    References

    [1] Mikael Gustavsson et al., Transformers enable accurate prediction of acute and chronic chemical toxicity in aquatic organisms. Sci. Adv.10,eadk6669(2024). DOI:10.1126/sciadv.adk6669

    [2] TRIDENT prediction tool:https://trident.serve.scilifelab.se/

    [3] ECOTOX database:https://cfpub.epa.gov/ecotox/index.cfm

    Knightec Group AB

    FöretagKnightec Group AB
    Visa alla jobb för Knightec Group AB

    Liknande jobb

    Thesis Work for AI Assisted Design of Plantwide Control Architectures: An Application Study in the Process Industry

    ABB AB

    Västerås12/11 - tills vidare