Whirlpool: A microservice style scalable continuous topical web crawler

Pereira, Rihan Stephen

Masters Thesis

Whirlpool: A microservice style scalable continuous topical web crawler

Historically, web crawlers/bots/spiders have been well known for indexing, ranking websites on the internet. This thesis augments the crawling activity but approaches the problem through the lens of a data engineer. Whirlpool as a continuous, topical web crawling tool is also a data ingestion pipeline implemented from bottom-up using RabbitMQ which is a high performance messaging buffer to organize the data flow within its network. It is based on a open, standard blueprint design of mercator. This paper discusses the high and low level design of this complex program covering auxiliary data structures, object-oriented design, addressing scalability concerns, and deployment on AWS. The project name Whirlpool is used as an analogy referring to the naturally occurring phenomenon where opposing water currents in sea cause water to spin round and round drawing various objects into it.

Date

2019-12

Resource Type

Masters Thesis

Creator

Pereira, Rihan Stephen

Advisor

Soltys, Dr. Michael

Committee Member

Campus

Channel Islands

Publisher

California State University, Channel Islands

Degree Level

Masters

Subjects

Date Accessioned

2020-01-27T18:02:21Z

Handle

http://hdl.handle.net/10211.3/214919

["Made available in DSpace on 2020-01-27T18:02:21Z (GMT). No. of bitstreams: 1\nPereira, Rihan MSCS Thesis Fall 19_OCRDone.pdf: 20672243 bytes, checksum: 8b3c867cd5445d69a0c5ed9cdb51c5e0 (MD5)\n Previous issue date: 2019-12", "Submitted by Sebastian Hunt (sebastian.hunt262@myci.csuci.edu) on 2020-01-27T18:02:21Z\nNo. of bitstreams: 1\nPereira, Rihan MSCS Thesis Fall 19_OCRDone.pdf: 20672243 bytes, checksum: 8b3c867cd5445d69a0c5ed9cdb51c5e0 (MD5)"]

Language

English

Thumbnail	Title	Date Uploaded	Visibility	Actions
	Pereira__Rihan_MSCS_Thesis_Fall_19_OCRDone.pdf	2020-04-08	Public	Download

Downloadable Content

Whirlpool: A microservice style scalable continuous topical web crawler