Approximate pattern matching using search schemes and in-text verification

Renders, Luca; Depuydt, Lore; Fostier, Jan

Approximate pattern matching using search schemes and in-text verification

Luca Renders (UGent) , Lore Depuydt (UGent) and Jan Fostier (UGent)

(2022) Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Abstracts.

Author

Luca Renders (UGent) , Lore Depuydt (UGent) and Jan Fostier (UGent)

Organization

Department of Information technology

Project

Search schemes for sequence alignment

Abstract

Approximate pattern matching entails finding approximate occurrences of a search pattern P in a search text T. In a typical bioinformatics scenario, P is a DNA fragment (a read) and T a (collection of) reference genomes(s). Search schemes are an efficient computational technique that guarantee to exhaustively identify all occurrences within a pre-specified number of allowed errors (substitutions and indels). Using a bidirectional FM-index, they describe how to traverse the search space in such a way that runtime is minimized. Despite their optimal time complexity, basic operations on the FM-index require expensive random memory access. We examine whether in-text verification, in which a candidate occurrence is validated in the search text via a bit-parallel pairwise alignment approach, can be used to supplement in-index matching using search schemes. In comparison to pure in-index matching, we find that hybrid in-index/in-text matching can reduce running time by more than a factor of two. We introduce Columba 1.1, an open-source C++ software program that efficiently implements these ideas. Columba 1.1 can locate, within a maximum edit distance of 4, all occurrences of 100,000 reads (150 bp) in the human reference genome in about 30 seconds. This outperforms existing, state-of-the-art lossless alignment tools such as Bwolo, Yara and GEM and is comparable to the lossy aligner BWA in mem mode.

Downloads

Download

fears 2022 pitch.pdf
- full text (Author's original)
- |
- open access
- |
- PDF
- |
- 75.86 KB

Citation

Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-01GMVAZ64V3YYZZA12S990AGQC

MLA: Renders, Luca, et al. “Approximate Pattern Matching Using Search Schemes and In-Text Verification.” Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Abstracts, 2022, doi:10.5281/zenodo.7401042.
APA: Renders, L., Depuydt, L., & Fostier, J. (2022). Approximate pattern matching using search schemes and in-text verification. Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Abstracts. Presented at the Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Ghent, Belgium. https://doi.org/10.5281/zenodo.7401042
Chicago author-date: Renders, Luca, Lore Depuydt, and Jan Fostier. 2022. “Approximate Pattern Matching Using Search Schemes and In-Text Verification.” In Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Abstracts. https://doi.org/10.5281/zenodo.7401042.
Chicago author-date (all authors): Renders, Luca, Lore Depuydt, and Jan Fostier. 2022. “Approximate Pattern Matching Using Search Schemes and In-Text Verification.” In Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Abstracts. doi:10.5281/zenodo.7401042.
Vancouver: 1.
Renders L, Depuydt L, Fostier J. Approximate pattern matching using search schemes and in-text verification. In: Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Abstracts. 2022.
IEEE: [1]
L. Renders, L. Depuydt, and J. Fostier, “Approximate pattern matching using search schemes and in-text verification,” in Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Abstracts, Ghent, Belgium, 2022.

@inproceedings{01GMVAZ64V3YYZZA12S990AGQC,
  abstract     = {{Approximate pattern matching entails finding approximate occurrences of a search pattern P in a search text T. In a typical bioinformatics scenario, P is a DNA fragment (a read) and T a (collection of) reference genomes(s). Search schemes are an efficient computational technique that guarantee to exhaustively identify all occurrences within a pre-specified number of allowed errors (substitutions and indels). Using a bidirectional FM-index, they describe how to traverse the search space in such a way that runtime is minimized. Despite their optimal time complexity, basic operations on the FM-index require expensive random memory access. We examine whether in-text verification, in which a candidate occurrence is validated in the search text via a bit-parallel pairwise alignment approach, can be used to supplement in-index matching using search schemes. In comparison to pure in-index matching, we find that hybrid in-index/in-text matching can reduce running time by more than a factor of two. We introduce Columba 1.1, an open-source C++ software program that efficiently implements these ideas. Columba 1.1 can locate, within a maximum edit distance of 4, all occurrences of 100,000 reads (150 bp) in the human reference genome in about 30 seconds. This outperforms existing, state-of-the-art lossless alignment tools such as Bwolo, Yara and GEM and is comparable to the lossy aligner BWA in mem mode.}},
  author       = {{Renders, Luca and Depuydt, Lore and Fostier, Jan}},
  booktitle    = {{Faculty of Engineering and Architecture Research Symposium 2022 (FEARS 2022), Abstracts}},
  language     = {{eng}},
  location     = {{Ghent, Belgium}},
  pages        = {{1}},
  title        = {{Approximate pattern matching using search schemes and in-text verification}},
  url          = {{http://doi.org/10.5281/zenodo.7401042}},
  year         = {{2022}},
}

Altmetric: View in Altmetric

Academic Bibliography

Approximate pattern matching using search schemes and in-text verification

Downloads

Citation

Contents

Support

Contact