Title
Improving compression ratio on human-readable text by masking predictable characters
Abstract
A text preprocessing algorithm that improved the compression ratio of a standard compression tool, bzip2, was developed. During the preprocessing, characters in the original text files were replaced by a few special characters such that the Burrows-Wheeler Transform part of bzip2 was enhanced upon. These special characters made the words less recognizable but they were still recoverable through the exploitation of the semantic relations between words in the text. The recovery process was carried out with the use of a static English dictionary and a pretrained static neural network, Word2vec. Experiments showed that this method increased in the compression ratio at an average rate of 2.9% for text files over 100KB.
Funding information
This research was supported by the Undergraduate Research Opportunities Program (UROP).
Suggested Citation
Trinh, Nam H.
(2020).
Improving compression ratio on human-readable text by masking predictable characters.
Retrieved from the University of Minnesota Digital Conservancy,
https://hdl.handle.net/11299/216732.