규칙기반 자연어처리 기술을 이용한 의료문서 필터링

윤은실

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

규칙기반 자연어처리 기술을 이용한 의료문서 필터링 : Filtering Clinical Text Using Rule-based Natural Language Processing

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 윤은실

Advisor: 최진욱

Major: 공과대학 협동과정 바이오엔지니어링전공

Issue Date: 2014-08

Publisher: 서울대학교 대학원

Keywords: 자연어처리 ; Rule-based classifier ; Decision Tree ; 의료문서 필터링

Description: 학위논문 (석사)-- 서울대학교 대학원 : 협동과정 바이오엔지니어링전공, 2014. 8. 최진욱.

Abstract: 임상의는 갑상선 환자의 초음파 판독문을 매일 접하게 되는데, 판독 결과 갑상선암이 의심되는 환자는 정상인 환자에 비해 그 비율이 적다. 하지만 임상의는 모든 판독문을 일일이 보고 정상인지 암환자인지 구분하고 있다. 정상인 환자의 판독문을 보는데 들어가는 임상의의 시간과 수고를 덜어주고자 본 연구를 진행하게 되었다.
본 연구는 초음파 판독문을 정상인 케이스와 한번 더 봐야 하는 판독문의 경우로 분류하는 동시에 정상인 케이스는 확실하게 정상인 케이스만을 분류하는 시스템을 개발하고자 한다. 정상인 케이스에 한번 더 봐야 하는 판독문이 끼어 있는 것을 최대한 방지하기 위함이다. 이를 위해 판독문을 분석하고 자연어처리 기법을 이용해판독문에서 의미있는 키워드를 뽑아낸 후 이를 이용해 패턴을 생성, Rule-based Classifier에 적용하였다. Rule-based Classifier는 JAVA 기반으로 개발하였으며 92.34%의 Accuracy를 보였다. 모든 클래스의 성능에 가중치를 두어 도출해낸 Weighted Avg. of Precision은 0.946, Recall은 0.923, F-measure는 0.931의 성능을 보였다.
특히, 이 연구에서 가장 중요한 목표인 정상인 환자로 분류한 데이터 중 깨끗하게 정상인 환자임을 보는 측도로 Precision을 계산할 수 있는데, 이는 1.0의 성능을 보였다.
추가로 Rule-based Classifier의 성능을 비교 검토하고자 Decision Tree와 Machine Learning Technique을 활용하였는데, Overall Accuracy에서 Rule-based Classifier가 Decision Tree보다 0.74% 낮았다. 하지만 개수로 보면 3개의 데이터를 맞지 않게 분류한 것이다. 또한 본 연구에서 중요하게 보는 정상인 환자의 Precision은 Decision Tree와 Machine Learning Technique에서는 1.0보다 낮은 0.962, 0.945를 기록하였다.
Physicians encounter ultrasound reports of thyroid neoplasm everyday. Ultrasound reports are classified into three types. RECUR is a report of a patient whose cancer recurred. INTER is a report used when it is not certain whether cancer recurred or not. NED stands for No Evidence of Disease. The proportion among the three types is not uniform. It is more likely to see NED reports than RECUR or INTER reports. However, physicians have to review all the reports manually. Physicians want to see the detail of the recurrence reports, so filtering reports that do not have the evidence of disease is important and can reduce human workload. These documents are clinical texts, thus classifying RECUR documents as NED documents is unacceptable. We developed a rule-based classifier using JAVA which detects the keywords in the reports and classifies the reports into the three categories using the patterns. The evaluation showed 92.34% accuracy in classifying into the three types. Also, a very crucial result of this paper is the 1.0 of precision in NED class. The 1.0 precision in NED means that those classified as NED consist of documents that were actually only NED documents. In addition, to evaluate the rule-based classifier, we experimented the decision tree and machine learning techniques. The decision tree and machine learning technique experimented using WEKA. For this experiment, we used 80 keywords for the feature sets. The overall accuracy of the decision trees was 0.74% higher than the rule-based classifier, but it misclassified three documents. The precision in NED was 0.962 in the decision tree and 0.945 in the machine learning technique which were all lower than the precision of 1.0 in the rule-based classifier.

Language: Korean

URI: https://hdl.handle.net/10371/122436

Files in This Item:

000000021017.pdf 1.46 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Program in Bioengineering (협동과정-바이오엔지니어링전공)
  - Theses (Master's Degree_협동과정-바이오엔지니어링전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share