Ullrich, Katrin: Multi-View Kernel Methods for Binding Affinity Prediction. - Bonn, 2021. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-63908
@phdthesis{handle:20.500.11811/9314,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-63908,
author = {{Katrin Ullrich}},
title = {Multi-View Kernel Methods for Binding Affinity Prediction},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2021,
month = sep,

note = {In the present thesis, we focus on the potential and limits of multi-view regression techniques in the field of ligand affinity prediction.
Multi-view learning (MVL) denotes machine learning approaches that utilise different representations (views) on data. It is known that MVL improves the performance of machine learning algorithms in many important real-world applications, but there is hardly any thorough evaluation of MVL in the life science domain. The binding of small compounds to large protein molecules is central to the activity of the cell as such processes are involved in the majority of biochemical pathways. A real-valued binding affinity characterises the binding strength of the protein-ligand complex. The identification of these affinities can serve as an initial point for the discovery of drugs correlated with the respective pathways. For binding affinity prediction in ligand-based virtual screening, single-view support vector regression (SVR) utilising molecular fingerprints is the state-of-the-art approach.
The special situation with respect to the representation and availability of data suggests the application of multi-view regression for affinity prediction. Views are data representations canonically related to so-called kernel functions. On the one hand, different representations of data instances are available for affinity prediction naturally as a large variety of molecular descriptors designed for different purposes exist. On the other hand, labelled data is typically not abound because of the huge number of existing proteins. We address these challenges and present multi-view kernel approaches to overcome the mentioned difficulties in three different learning scenarios. The general question of the thesis is whether affinity prediction benefits from the diversity of useful representations for molecular compounds via MVL.
Firstly, we consider the case of supervised learning where labelled training compounds for a precise protein are available. Moreover, we assume different molecular fingerprints based on the graph structure of molecules exist. We enhance the set of existing fingerprints with feature vectors of systematically enumerated cyclic, tree, shortest path, and Weisfeiler-Lehman label patterns. We are the first who apply multiple kernel learning (MKL) that identifies a linear combination of the utilised set of views in the context of affinity prediction. In addition to the rich set of data representations, we investigate both a loss function known from regularised least squares regression (RLSR) and one from SVR. In our practical experiments, we analyse the influence of different patterns on the affinity prediction performance. We suggest a scheme for a systematical preselection of graph patterns of molecular compounds. In our empirical analysis we show that MKL with a preselection of graph patterns or standard molecular fingerprints outperforms state-of-the-art algorithms for ligand affinity prediction.
Secondly, we take into account the small number of compounds with known affinity and exploit the availability of unlabelled data. In addition to empirical risk minimisation in the supervised case, the technique of co-regularisation permits a semi-supervision via an adjustment of predictions for unlabelled instances. This adjustment occurs for pairwise predictions from different views. We define co-regularised support vector regression (CoSVR) analogously to the approach of co-regularised least squares regression (CoRLSR). We investigate different co-regularisation terms and CoSVR variants with decreasing number of optimisation variables. Furthermore, we derive a multi-view CoSVR version with single-view complexity and a Rademacher bound for the corresponding function class. We prove empirically that ligand affinity prediction profits from the application of CoSVR in comparison to the baseline approaches.
Finally, we consider the unsupervised task of orphan screening where no labelled training data is available for the considered protein. We propose corresponding projections (CP) as a novel kernel method for unsupervised or transfer learning. CP can be defined both as single-view and multi-view algorithm and is applicable to learning tasks like classification as well. Our empirical results show that CP outperforms the orphan screening baseline of the target-ligand kernel approach and approximates the performance of supervised algorithms that utilise very few labelled training examples.},

url = {https://hdl.handle.net/20.500.11811/9314}
}

The following license files are associated with this item:

InCopyright