YouTube data analysis using Hadoop

Khosla, Charu

Graduate Project

YouTube data analysis using Hadoop

Analysis of structured data has seen tremendous success in the past. However, analysis of large scale unstructured data in the form of video format remains a challenging area. YouTube, a Google company, has over a billion users and generate billions of views. Since YouTube data is getting created in a very huge amount and with an equally great speed, there is a huge demand to store, process and carefully study this large amount of data to make it usable The main objective of this project is to demonstrate by using Hadoop concepts, how data generated from YouTube can be mined and utilized to make targeted, real time and informed decisions. The project utilizes the YouTube Data API (Application Programming Interface) that allows the applications/websites to incorporate functions that are used by YouTube application to fetch and view information. The Google Developers Console is used to generate a unique access key which is further required to fetch YouTube public channel data. Once the API key is generated, a .Net(C#) based console application is designed to use the YouTube API for fetching video(s) information based on a search criteria. The text file output generated from the console application is then loaded from HDFS (Hadoop Distributed File System) file into HIVE database. HDFS is a primary Hadoop application and a user can directly interact with HDFS using various shell-like commands supported by Hadoop. This project uses SQL like queries that are later run on Big Data using HIVE to extract the meaningful output which can be used by the management for analysis.

Date