Colloquium: Stefan Schipper

Colloquium
STEPHAN Substrate To Enzyme Prediction: Hierarchical Analysis of Novelties
Stefan Schipper
Date
Thursday 04 Jun 2026
Time
16:30 - 17:00
Location
BW018
Supervisor
Gerard van Westen
Jury
Lars Jeuken

Enzymes are efficient, often highly selective, and environmentally friendly biological catalysts, making them attractive tools for industrial and pharmaceutical applications. Classical methods to optimize enzymes like directed evolution and rational design are labor- and cost-intensive. Advancements in computational methods and AI models have driven the emergence of enzyme-substrate prediction AI models. In this literature review, recent state-of-the-art enzyme-substrate prediction models ProSmith, ESP, VIPER, FusionESP, EZSpecificity, EMMA, CataPro, and PocketGNN are evaluated based on their technical architecture, their training dataset, performance on a random split, and ease of implementation of the pre-trained model and installation and training using new data. Most models use transformer networks to translate protein sequences and substrate SMILES into embeddings used for downstream AI models. EZSpecificity and PocketGNN use docking poses as input for graph neural networks. Another key difference is the mode and timing of embedding interaction. For classification, EMMA is unique in its additional classification class of "inhibitors" while CataPro and PocketGNN predict the regressive kinetic parameters Km and kcat. All models performed well on a random split of training and validation data, but the evaluation between models becomes interesting for unseen data. The models are evaluated on different datasets, data splits, and performance metrics. The structure-based models EZSpecificity and PocketGNN do not show a large performance improvement over sequence-based models but are considerably more complex and computationally expensive. With limited computational resources and programming experience, FusionESP is considered the preferred model for binary prediction, because it is extensively validated, shows slightly better performance, and has low computational cost. To improve the performance of FusionESP, the high-quality datasets EMMA and VIPER's dataset can be used as training data. CataPro is the best option for regressive predictions, due to its performance and ease of implementation. To improve enzyme efficiency CataPro performed an interesting evaluation of the model on mutant proteins, showing no predictive value, but limited training and test data were available. The challenge remains in truly unseen data, which could potentially see great improvements with higher quality and bigger datasets and potentially if future models are able to capture transition-states.