Masters Thesis

Whirlpool: A microservice style scalable continuous topical web crawler

Historically, web crawlers/bots/spiders have been well known for indexing, ranking websites on the internet. This thesis augments the crawling activity but approaches the problem through the lens of a data engineer. Whirlpool as a continuous, topical web crawling tool is also a data ingestion pipeline implemented from bottom-up using RabbitMQ which is a high performance messaging buffer to organize the data flow within its network. It is based on a open, standard blueprint design of mercator. This paper discusses the high and low level design of this complex program covering auxiliary data structures, object-oriented design, addressing scalability concerns, and deployment on AWS. The project name Whirlpool is used as an analogy referring to the naturally occurring phenomenon where opposing water currents in sea cause water to spin round and round drawing various objects into it.

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.