Bicocca Open Archive

Two of the main bioinformatics fields that have been influenced by the introduction of the Next-Generation Sequencing (NGS) techniques are transcriptomics and metagenomics. The adoption of these new methods to sequence DNA/RNA molecules has drastically changed the kind and also the amount of produced data. The effect is that all the developed algorithms and tools working on traditional data cannot be applied on NGS data. For this reason, in this thesis we face two central problems in two fields: transcriptmics and metagenomics. The first one regards the characterization of the Alternative Splicing (AS) events starting from NGS sequences coming from transcripts (called RNA-Seq reads). To this aim we have modeled the structure of a gene, with respect to the AS variations occurring in it, by using a graph representation (called splicing graph). More specifically, we have identified the conditions for the correct reconstruction of the splicing graph, starting from RNA-Seq data, and we have realized an algorithm for its construction. Moreover, our method is able to correct reconstruct the splicing graph even when the input RNA-Seq reads do not respect the identified conditions. Finally, we have performed an experimental analysis of our procedure in order to validated the obtained results. The second problem we face in this thesis is the assignment of NGS read, coming from a metagenomic sample, to a reference taxonomic tree, in order to assess the composition of the sample and classify the unknown micro-organisms in it. This is done by aligning the reads to the taxonomic tree and then choosing (when there are more valid matches) the node that best represents the read. This choice is based on the calculation of a Penalty Score (PS) function for all the nodes descending from the lowest common ancestor of the valid matches in the tree. We have realized an optimal algorithm for the computation of the PS function, based on the so called skeleton tree, which improve the performances of the taxonomic assignment procedure. We have also implemented the method by using more efficient data structures, with respect to the one used in the previous version of the procedure. Finally, we have offered the possibility to switch among different taxonomies by developing a method to map trees and translate the input alignments.

(2013). Algorithms for next generation sequencing data analysis. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2013).

Algorithms for next generation sequencing data analysis

BERETTA, STEFANO

2013

Abstract

Two of the main bioinformatics fields that have been influenced by the introduction of the Next-Generation Sequencing (NGS) techniques are transcriptomics and metagenomics. The adoption of these new methods to sequence DNA/RNA molecules has drastically changed the kind and also the amount of produced data. The effect is that all the developed algorithms and tools working on traditional data cannot be applied on NGS data. For this reason, in this thesis we face two central problems in two fields: transcriptmics and metagenomics. The first one regards the characterization of the Alternative Splicing (AS) events starting from NGS sequences coming from transcripts (called RNA-Seq reads). To this aim we have modeled the structure of a gene, with respect to the AS variations occurring in it, by using a graph representation (called splicing graph). More specifically, we have identified the conditions for the correct reconstruction of the splicing graph, starting from RNA-Seq data, and we have realized an algorithm for its construction. Moreover, our method is able to correct reconstruct the splicing graph even when the input RNA-Seq reads do not respect the identified conditions. Finally, we have performed an experimental analysis of our procedure in order to validated the obtained results. The second problem we face in this thesis is the assignment of NGS read, coming from a metagenomic sample, to a reference taxonomic tree, in order to assess the composition of the sample and classify the unknown micro-organisms in it. This is done by aligning the reads to the taxonomic tree and then choosing (when there are more valid matches) the node that best represents the read. This choice is based on the calculation of a Penalty Score (PS) function for all the nodes descending from the lowest common ancestor of the valid matches in the tree. We have realized an optimal algorithm for the computation of the PS function, based on the so called skeleton tree, which improve the performances of the taxonomic assignment procedure. We have also implemented the method by using more efficient data structures, with respect to the one used in the previous version of the procedure. Finally, we have offered the possibility to switch among different taxonomies by developing a method to map trees and translate the input alignments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tutor afferente a Bicocca
	
			BONIZZONI, PAOLA
		
	Supervisori e coordinatori esterni
	
			DELLA VEDOVA, GIANLUCA
		
	Parole chiave
	
			bioinformatics, computational biology, alternative splicing, transcript, gene structure, NGS, next generation sequencing, RNA-seq, algorithms, metagenomics, splicing graph
		
	* Settore disciplinare della tesi
	
			INF/01 - INFORMATICA
		
	* Lingua del contenuto
	
			English
		
	* Data di discussione
	
			22-feb-2013
		
	* Corso di dottorato
	
			INFORMATICA - 22R
		
	* Ciclo di dottorato
	
			25
		
	* Anno accademico di conseguimento titolo
	
			2011/2012
		
	Fulltext
	
			open
		
	Citazione
	
			(2013). Algorithms for next generation sequencing data analysis. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2013).
		
	Appare nelle tipologie:
	
			07 - Tesi di dottorato Bicocca post 2009

File in questo prodotto:

File	Dimensione	Formato
Phd_unimib_057617.pdf accesso aperto Tipologia di allegato: Doctoral thesis Dimensione 2.63 MB Formato Adobe PDF Visualizza/Apri	2.63 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/42355

Citazioni

ND

ND

Social impact