Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/133575
Citations
Scopus Web of ScienceĀ® Altmetric
?
?
Type: Journal article
Title: Generating name-like vectors for testing large-scale entity resolution
Author: Herath, S.
Roughan, M.
Glonek, G.
Citation: IEEE Access, 2021; 9:145288-145300
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Issue Date: 2021
ISSN: 2169-3536
2169-3536
Statement of
Responsibility: 
Samudra Herath, Matthew Roughan and Gary Glonek
Abstract: Entity resolution (ER), the problem of identifying and linking records that belong to the same real-world entities in structured and unstructured data, is a primary task in data integration. Accurate and efficient ER has a major practical impact on various applications across commercial, security and scientific domains. Recently, scalable ER techniques have received enormous attention with the increasing need to combine large-scale datasets. The shortage of training and ground truth data impedes the development and testing of ER algorithms. Good public datasets, especially those containing personal information, are restricted in this area and usually small in size. Due to privacy and confidential issues, testing algorithms or techniques with real datasets is challenging in ER research. Simulation is one technique for generating synthetic datasets that have characteristics similar to those of real data for testing algorithms. Many existing simulation tools in ER lack support for generating large-scale data and have problems in complexity, scalability, and limitations of resampling. In our work, we propose a simple, inexpensive, and fast synthetic data generation tool. Our tool only generates entity names in the first stage, but these are commonly used as identification keys in ER algorithms. We avoid the detail-level simulation of entity names using a simple vector representation that delivers simplicity and efficiency. In this paper, we discuss how to simulate simple vectors that approximate the properties of entity names. We describe the overall construction of the tool based on data analysis of a namespace that contains entity names collected from the actual environment.
Keywords: Entity resolution; data integration; data linkage; data matching; information systems; large-scale synthetic data; record linkage
Rights: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
DOI: 10.1109/ACCESS.2021.3122451
Grant ID: ARC
Published version: http://dx.doi.org/10.1109/access.2021.3122451
Appears in Collections:Electrical and Electronic Engineering publications

Files in This Item:
File Description SizeFormat 
hdl_133575.pdfPublished Version1.26 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.