Mixture models, theory with application to species richness, clustering, classification and all that jazz
Open access
Author
Date
2022Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Mixture models occur in numerous settings including random and fixed effects models, clustering, deconvolution, empirical Bayes problems and many others. They are often used to model data originating from a heterogeneous population, consisting of several homogeneous subpopulations, thus the problem of finding a good estimator for the number of components in the mixture arises naturally. Estimation of the order of a finite mixture model is a hard statistical task, and multiple techniques have been suggested for solving it. In this thesis we concentrate on several such methods that have not gained much popularity but are nonetheless interesting from the theoretical viewpoint as well as deserve the attention of practitioners. The said methods can be categorized into three groups: tools built upon the determinant of the Hankel matrix of moments of the mixing distribution, minimum distance estimators, likelihood ratio tests. One of the valuable features of all of these approaches is that they all come with theoretical guarantees for consistency. We address theoretical pillars underlying each of the statistical techniques and present the results of the comparative numerical study that has been conducted under various scenarios. In addition to the above mentioned methods we have also added the results of the neural-network-based approach. According to the results, none of the methods proves to be a "magic pill". The results uncover limitations of the techniques and provide practical hints for choosing the best-suited tool under specific conditions.
After discussing the relevant theory and analysing some simulation results, we introduce the software that allows for convenient and flexible implementation of the discussed methods for simulated data as well as for real datasets whenever the data in hand is univariate. We also demonstrate the performance of the studied techniques on real world data.
We further discuss the feasibility of extensions of some of these methods to the multidimensional setting and present some simulation results. Afterwards we apply the multidimensional extensions of the studied approaches in the context of the clustering problem when analysing two multivariate real-world datasets. Studying the data structure using the multivariate mixture model for one of the sets and experimenting with the solutions has led us to the final result we did not expect to achieve when starting our work.
In the scope of this thesis we also consider the application of the mixture models to the problem of species richness estimation under the assumption of complete monotonicity of the distribution of species abundances. Complete monotonicity, which can be shown to be linked to the mixture of geometric distributions, is the natural substitute for k-monotonicity when k is large. As it is known that the latter model has already been quite successfully considered in the species richness estimation problem, the complete monotonicity approach seems to be a natural alternative to the k-monotone solution when k is large. An extended simulation study indicates that the complete monotone estimator is quite competitive when compared to other available estimators, and this remains true even when complete monotonicity is not satisfied. Using four real datasets, we further illustrate how our method can be applied in practice. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000581661Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichOrganisational unit
08845 - Balabdaoui, Fadoua (Tit.-Prof.) / Balabdaoui, Fadoua (Tit.-Prof.)
More
Show all metadata
ETH Bibliography
yes
Altmetrics