跳到主要內容

臺灣博碩士論文加值系統

(52.91.67.23) 您好!臺灣時間:2024/03/28 23:20
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林承翰
研究生(外文):Cheng-Han Lin
論文名稱:在不平衡的資料中改進少數類別分類正確率的方法之研究
論文名稱(外文):Under-Sampling Approaches for Improving Classification Accuracy of Minority Class in an Imbalanced Dataset
指導教授:顏秀珍顏秀珍引用關係徐嘉連徐嘉連引用關係
指導教授(外文):Show-Jane Yen
學位類別:碩士
校院名稱:輔仁大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:51
中文關鍵詞:資料探勘分類分析不平衡類別分佈分群技術減少多數法
外文關鍵詞:Data MiningClassificationImbalanced Class DistributionClusteringUnder-Sampling Approach
相關次數:
  • 被引用被引用:2
  • 點閱點閱:1013
  • 評分評分:
  • 下載下載:83
  • 收藏至我的研究室書目清單書目收藏:1
隨著資料探勘(data mining)技術的蓬勃發展,應用於實際資料庫的例子已不勝枚舉。其中分類技術(classification)是廣為討論及使用的探勘技術之一。分類結果的好壞取決於訓練後模型的預測正確率,而影響預測正確率最重要的其中一個要素為訓練資料。理想的訓練資料其類別分佈應該是相當平均的,但是現實的資料常常發生某個類別的資料佔了大部分,將這種資料使用分類技術得到的分類預測系統傾向預測擁有多數資料的類別。例如某家銀行的客戶資料裡,信用貸款客戶只佔了所有客戶的百分之十,如果銀行利用分類技術來預測客戶將來的信用貸款與否,把所有資料交由分類系統學習訓練之後,會發現分類預測系統將預測絕大多數的客戶將不會信用貸款,與銀行想找出潛在信貸客戶的目標互相違背,造成銀行固然擁有大量客戶資訊,然而卻得到預測正確率很差及實用性不高的分類預測系統。在這種資料類別不平衡的問題中,如何挑選適當訓練資料是個十分關鍵的因素。本篇研究提出結合分群技術的減少多數法,試著在類別分佈不平衡的資料裡從多數類別資料選取部份代表性的訓練資料,配合類神經網路演算法,來提升分類預測系統的分類效能,並以文獻中其他的減少多數法作為比較基準。實驗結果顯示,分群式減少多數法可以合理的提升分類預測系統的預測正確率並減少訓練分類系統的時間。
Needless to say, there are more and more data analysis with data mining techniques in our daily life. Classification technique is one of the most important and well-known method in data mining techniques, and one of the most important factor for improving classification accuracy is the training data. However, the data in real-world applications are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are the training data, the classifier tends to predict that most of the incoming data belong to the majority class. For example, a bank would like to construct a classifier to predict that whether the customers will have fiduciary loans in the future or not. The number of customers who have had fiduciary loans is only ten percent of all customers. If the bank uses all the customer data for training the classifier, the classifier will predict that most customers will not have fiduciary loans in the future. However, the performance of the classifier violates the bank’s goal of finding as many customers who may have fiduciary loans as possible. Though the bank has a lot of customer data, it can not derive a prediction system which both has high accuracy for finding the target customers and is practical. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem.
In this study, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class in the imbalanced class distribution problem. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.
Chapter 1. Introduction
1.1 Classification Analysis
1.2 Motivation and Objectives
1.3 Architecture

Chapter 2. Related Work
2.1 Backpropagation Neural Network
2.2 Developed Methods for Solving Imbalanced Problem

Chapter 3. Approach
3.1 Under-Sampling Based On Clustering
3.2 Under-Sampling Based On Clustering and Distance

Chapter 4. Experiments
4.1 Generate the Artificial Datasets
4.2 Evaluation Standard
4.3 Experimental Results and Analyses of Artificial Datasets
4.4 Experimental Results and Analyses of Real Application

Chapter 5. Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
[1]N. V. Chawla. “C4.5 and Imbalanced Datasets: Investigating the Effect of Sampling Method, probabilistic estimate, and decision tree structure.” In Proceedings of the ICML’03 Workshop on Class Imbalances, August 2003.
[2]N. V. Chawla, K.W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique”, Journal of Artificial Intelligence Research, vol. 16, pages 321-357, 2002.2.
[3]Doina Caragea, Dianne Cook, Vasant Honavar. “Gaining Insights into Support Vector Machine Pattern Classifiers Using Projection-Based Tour Methods.” In Proceedings of the KDD Conference, San Francisco, CA, pages 251-256, 2001.
[4]N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer. “Smoteboost: Improving Prediction of the Minority Class in Boosting.” In Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 107-119, Dubrovnik, Croatia, 2003.
[5]P. Clark and T. Niblett. “The CN2 Induction Algorithm.” Machine Learning, 3(4):261-283, 1989.
[6]Chris Drummond, Robert C. Holte. “C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling.” In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Datasets, 2003.
[7]C. Elkan. “The Foundations of Cost-sensitive Learning.” In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973-978, 2001.
[8]Yoav Freund, H. Sebastian Seung, Eli Shamir, Naftali Tishby. “Selective Sampling Using the Query by Committee Algorithm.” Machine Learning, v.28 n.2-3, pages 133-168, Aug./Sept. 1997.
[9]Rafael del-Hoyo, David Buldain, Alvaro Marco. “Supervised Classification with Associative SOM.” Lecture Notes in Computer Science 2686, pp334-341. 7th International Work-Conference on Artificial and Natural Neural Networks, IWANN 2003.
[10]In N. Japkowicz, editor, Proceedings of the the AAAI’2000 Workshop on Learning from Imbalanced Data Sets , AAAI Tech Report WS-00-05. AAAI, 2000.
[11]N. Japkowicz. “Concept-learning in the Presence of Between-class and Within-class imbalances.” In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67-77, 2001.
[12]T. Jo and N. Japkowicz. “Class Imbalances versus Small Disjuncts.” SIGKDD Explorations, 6(1):40-49, 2004.
[13]Marti A. Hearst. “Trends Controversies: Support Vector Machine.” IEEE Intelligent System, 13(4):18-28, 1998.
[14]M. Maloof. “Learning when Data Sets are Imbalanced and when Costs are Unequal and Unknown.” In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data sets, 2003.
[15]L. M. Manevitz and M. Yousef. “One-class SVMs for Document Classification.” Journal of Machine Learning Research, 2:139-154, 2001.
[16]Quinlan, J. R.. “C4.5: Programs for Machine Learning,” Morgan Kaufmann, San Mateo, CA, 1993.
[17]Rumelhart, D. E., Hinton, G. E., and Williams, R.J. “Learning Internal Representations by Error Propagation.” In Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1, D. E. Rumelhart and J. L. McClelland (Eds.), MIT Press, Cambridge, MA, pages 318-362, 1986.
[18]B. Raskutti and A. Kowalczyk. “Extreme Rebalancing for SVMs: A Case Study”. SIGKDD Explorations, 6(1):60-69, 2004.
[19]Shlomo Argamon-Engelson, Ido Dagan. “Committee-based Sample Selection for Probabilistic Classifiers.” Journal of Artificial Intelligence Research (JAIR), Vol. 11, pages 335-360, 1999.
[20]N. E. Sondak, V. K. Sondak. “Neural Networks and Artificial Intelligence.” In Proceedings of the twentieth SIGCSE technical symposium on Computer science education. 1989.
[21]P. Turney. “Types of Cost in Inductive Concept Learning.” In Proceedings of the ICML'2000 Workshop on Cost-Sensitive Learning, pages 15-21, 2000.
[22]B. Zadrozny and C. Elkan. “Learing and Making Decisions when Costs and Probabilities are both unknown.” In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 204-213, 2001.
[23]J. Zhang and I. Mani. “kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction.” In Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets, 2003.
[24]齊玉美,「不對稱性分類分析之研究」,國立中山大學資訊管理研究所碩士論文,2003 年。
[25]林育臣,「群集技術之研究」,朝陽科技大學資訊管理研究所碩士論文,2002年。
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
1. 3. 王瑞(2000),<人力資源在企業競爭優勢中所扮演的角色>《人力發展月刊》,第七十七期,頁42-51。
2. 4. 王寬弘(2000),<警察人力資源管理>,《中央警察大學學報》,第三十
3. 2. 王進旺、黃肇嘉(1995),<警察人力資源規劃與未來警察教育取向>,《警專學報》,第八期。
4. 17. 廖仁傑(1999),《台灣地區高齡就業之研究》,文化大學勞工研究所碩士論文。
5. 5. 邱華君(1994),<論警察人力資源規劃>,《警學叢刊》,第廿四卷第二期。
6. 6. 邱華君(1995),<公元二○○○年的警察人力資源與教育>,《警學叢刊》,第廿六期第二期。
7. 7. 朱德湘、彭塞雲(2002),<警察教育政策相關問題之探討>,《警學叢刊》,第三十二卷第六期,頁137-156。
8. 8. 朱金池(2002),<警察人員教考用配合制度之研究>,《中央警察大學警政論叢》,第二期,頁51-74。
9. 9. 朱源葆(2001),<中美兩國警察教育訓練之比較>,《警學叢刊》,第三十一卷第五期,頁1-12。
10. 10. 吳斯茜、施妙宜(2002),<從人力資源發展觀點論警察在職訓練>,《警學叢刊》,第三十三卷第三期,頁281-298。
11. 11. 吳秉恩(1994),《策略性人力資源管理》,世界經理文摘,46-64 。
12. 12. 官政哲(1995),<廿一世紀警察人力資源管理與教育發展之理念與策略>,《警專學報》,第八期,頁21-42。
13. 13. 林炯棋(2002),<警察在職訓練應用線上學習之研究>,《警學叢刊》,第三十三卷第二期,頁247-266。
14. 14. 林文鵬(1989),<如何運用中老年人的人力>,《管理科學論述》,頁14。
15. 16. 林基源,(2000),<人力資源管理的新趨勢>,《公務人員月刊》,第四十四期。