Scoring Unusual Words with Varying Mismatch Errors

Apostolico, Alberto; Pizzi, Cinzia

doi:10.1007/s11786-007-0032-4

Patterns consisting of strings with a bounded number of mismatches are central to coding theory and find multiple applications in text processing and computational biology. In this latter field, the presence of over-represented patterns of this kind has been linked, for instance, to modeling regulatory regions in biosequences. The study and computation of expected number of occurrences and related scores for these patterns is made difficult by the sheer explosion of the roster of candidates that need to be evaluated. In recent work, properties of pattern saturation and score monotonicity have proved capable to mitigate this problem. In such a context, expectation and score monotonicity has been established within the i.i.d. model for all cases of interest except that of a fixed word length with a varying number of mismatches. The present paper completes this investigation by showing that the expected number of occurrences in a textstring for such a word is bi-tonic, that is, behaves as a unimodal function of the number of errors. This extends to this case the time and space savings brought about by discovery algorithms based on pattern saturation.