Parameter validation for a genomics population analysis

Buj Douirin, Paloma Marie
Compartir
A study by Ollé, J. and Viñas, J. aims to help create guidelines for the regulation of Atlantic Bonito, as it is an overfished species found in the Mediterranean, the coasts of the Iberian Peninsula, and the northwestern African continent. Being a commercially important species, it is necessary to determine which populations are part of it in order to create a conservation plan for biodiversity and prevent the loss of genetic variability in this species. To achieve this, samples have been collected from 92 individuals from Tunisia, Spain, northern Portugal, southern Portugal, Morocco, Mauritania, Senegal, and Ivory Coast. The restriction associated DNA sequencing (RADseq) technique has been employed to genotype all individuals and observe genetic differences among populations. This sequenced genetic information needs to go through an assembly program called Stacks, which includes the parameters m, M, and n that determine how these assemblies will be produced and influence the coverage and number of detected polymorphic sites. Therefore, this study investigates how these parameters m, M, and n influence a small representation of the populations, namely three individuals per population, to determine the most efficient values for recovering the maximum number of true polymorphic loci with high sequence coverage. Consequently, eleven tests are generated with different combinations of parameters corresponding to the mean values within the range of possible values. After collecting coverage and polymorphic loci data obtained from Stacks, the nonparametric statistical Kruskal-Wallis test is used, which reveals significant differences between the tests, both in terms of polymorphic data and coverage. In the case of coverage, it is observed that it mainly depends on the parameter m. Although there are significant differences between the highest and lowest assigned values of m, the data already shows high coverage for all the m values. Finally, the polymorphic site data reveal many more significant differences between groups, primarily dependent on M and n. The most important conclusion of this project is that for each set of RADseq data, a prior parameter validation step must be carried out since optimal values can vary depending on the species. In the case of the data used in this project, there is no single correct combination of parameters. Therefore, the obtained results serve as a guide for future projects using Atlantic Bonito data or related species when deciding the most optimal values for that specific data set ​
Aquest document està subjecte a una llicència Creative Commons:Reconeixement - No comercial - Sense obra derivada (by-nc-nd) Creative Commons by-nc-nd4.0