同質性與異質性資料庫之分散式分群

標題:	同質性與異質性資料庫之分散式分群 Distributed Clustering for Homogeneous and Heterogeneous Databases
作者:	黃致遠林志青 Huang, Chih-Yuan Lin, Ja-Chen 資訊科學與工程研究所
關鍵字:	分群;分散式分群;同質性資料庫;異質性資料庫;clustering;distributed clustering;homogeneous databases;heterogeneous databases
公開日期:	2017
摘要:	分群是一種被廣泛地運用在許多領域之常見且重要的技術。傳統的集中式分群法需要先將所有資料蒐集在一起，再對它們做分群。如今，資料量與日俱增，由於頻寬限制、傳輸成本以及隱私權保護等因素，使得將所有分散的資料傳到一個集中地變得不是永遠可行。為了克服上述的問題，我們提出了分散式分群法。在本文中，我們將資料庫的分佈區分為同質性的結構及異質性的結構，然後我們提出了兩個方法來解決此兩種分散式分群的問題。在我們提出的第一個方法中，我們著重在找出各個本地資料庫的代表點，接著我們將這些代表點傳到中央的資料庫，最後我們會依據這些代表點來獲得所有資料點的分群結果。在我們的第二個方法中，目的希望能夠達到比第一個方法更多的壓縮量，因此，我們盡可能地連代表點都不要傳。我們只將一些具有代表性的資訊，或者是代表點中的重要特徵傳到中央的資料庫，接著依據這一些僅有的資訊來決定所有資料點之間的相似度，然後得到最後的分群結果。在實驗中，我們使用了Iris、Wine以及Breast cancer三組資料集，對於每一組資料集，我們做適當的切割來模擬同質性以及異質性的資料結構。我們提出的兩個方法皆可以有效地將低傳輸的成本；更重要的是，我們的兩個方法皆能夠得到和集中式分群法非常接近的分群結果，而集中式分群法須用到所有資料庫的所有資料。 Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and privacy-violation issues, it is often infeasible to transfer all data to the central site. Therefore, our distributed clustering algorithms are good alternatives to overcome the shortcomings. In this study, our distributed clustering approaches work on homogeneous and heterogeneous database structures, respectively. We propose two methods to solve these distributed clustering problems. Our first proposed method, i.e. vector-quantization (VQ) method, try to find the representatives of all local sites, and then use all the local representatives to determine the final clustering result of whole data. Our second method, i.e. affinity-aggregation (AA) method, achieves more saving in transmission cost. Without transferring the entire local representatives, we only transfer to the central site the representative information or some important characteristics of representatives. Then, we use these reduced information to determine the pairwise similarity of whole data. Finally, we obtain the clustering result. Three datasets include Iris, Wine, and Breast cancer are tested in our experiment. To simulate the homogeneous and heterogeneous database structures, we split entire dataset into several subsets. Our two distributed clustering approaches can efficiently reduce the transmission cost. Furthermore, the clustering results of our approaches are very close to that of centralized clustering approaches, and the latter use the whole information of the whole data.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070456122 http://hdl.handle.net/11536/140912
顯示於類別：	畢業論文