Data Cleaning on Edge
Abstract
Internet of Things (IoT) has become a game changer and has facilitated the creation of new ecosystems and business models that are impacting every aspect of human life today. AI has stamped itself as the key component of these emerging ecosystems. The data generated by these ecosystems and the machine learning algorithm together act as the brain through a centralized cloud model. However, to address the challenges of critical applications requiring low latency and to take advantage of private data, the machine learning is shifting from centralized cloud system to the distributed edge. The ML models are as good as the input data and hence, quality of data becomes the key success factor which facilitates the need for real time cleaning at the edge. The techniques used today require manual intervention to clean the data and the ones that are completely automated do not work efficiently. The two-phase process proposed in this research combined two different techniques that complement each other well to remove almost all the outliers. The first phase prepares a base for the second phase to avoid overfitting, while the second phase splits the data into subsets based on the trends to remove the outliers.
The data is then imputed which gives us a near-perfect representation of the cleaned data in a completely automated way. We compare the two techniques we have derived through this technique with the standard algorithms and find that both these algorithms are a lot better than the standard algorithms. This is a univariate technique and can be transformed into a multivariate technique through an ensemble method with which we can clean the entire data set and get a better representation of the complete data. This technique is useful not just in the IoT domain but can also be used in the Telecom domain where data driven decisions at the Edge are becoming critical through the advent of 5G. This algorithm also facilitates auto-ML and federated learning.
Citation
Hooli, Mayuresh (2019). Data Cleaning on Edge. Master's thesis, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /188746.