表格文件結構抽取, 辨識與資料擷取之研究

標題:	表格文件結構抽取, 辨識與資料擷取之研究 A Study on Structure Extraction, Identification and Data Extraction in Tabular-form Document Images Processing
作者:	陳俊霖 Jiun-Lin Chen 李錫堅 Hsi-Jian Lee 資訊科學與工程研究所
關鍵字:	表格文件;文件影像處理;表格結構抽取;表格文件辨識;表格格線去除;欄位資料擷取;tabular form;form document images processing;form structure extraction;form document identification;form line removal;field data extraction
公開日期:	2000
摘要:	在本篇論文中，我們提出方法用以解決在自動表格文件處理系統所遇到的一些問題。這些問題包括了表格結構抽取、表格辨識、格線去除以及欄位資料的擷取等。首先，本論文提出一個利用帶狀投影（Strip Projection）的高效率表格結構抽取方法。我們先把輸入的表格文件影像，在水平及垂直方向各切割成等大小的帶狀區域（strips）。由於在表格文件中，絕大部分的格線都是水平及垂直線，因此我們採用在水平帶中作垂直投影、在垂直帶中作水平投影的方式來找尋格線。在這些投影輪廓中的高點就表示在這個位置可能有表格線存在。接著，我們就能以這些位置為出發點，在輸入的圖像中把線找出來。當所有可能的格線都找到後，多餘的線可用驗證的方法（line verification）去除，而斷開的格線也會被合併（line merging）。實驗結果顯示本方法處理一張A4的表格文件只需要約3秒的時間。與Hough方法及run-based方法比起來，相當地有效率。我們還提出了一個利用計算欄位之間的相似度的表格文件辨識方法。欄位相似的程度是由樣板欄位（template field）與輸入欄位，兩著的左上角及右下角之間的座標差所計算得出。輸入表格文件在抽取出格線後，會經過傾斜校正（de-skew）的動作才進行辨識。因此每一個表格欄位可以以其左上角及右下角的座標來定義。此外，由於較短的格線並不表示在該格線邊上的相對欄位也不大，所以一旦這類短格線抽取錯誤，將會影響到表格辨識的困難度。藉由比較欄位差異來進行表格辨識，我們的方法對於格線抽取錯誤較不敏感，也較比較格線差異的辨識方法來得有效率。由於表格上的輸入資料經常與表格格線重疊。本論文也描述了一個薄片去除（slice removing）方式的表格格線去除方法，此方法還能同時把必要的輸入資料保留下來。在本方法中，我們首先沿著格線走一次，估算該格線的寬度。接著，我們再沿著格線走第二次，並在每個位置計算該位置的有效寬度（effective line width），然後和估算的格線寬度比較。所謂有效線寬是指在該位置往與格線垂直的方向，找尋是否可以將該線段薄片長度延伸出去的最大長度。如果其有效寬度大於格線寬度，那麼在此位置的線段薄片會被保留下來。反之，此處的線段薄片則會被去除。本論文中還提出一個利用引力觀念（gravitation-based）的欄位資料擷取方法。此方法是設計來擷取表格中的手寫中文資料，因為它們經常都會寫出欄位格子。在表格格線都被去除之後，我們利用相連元件偵測法（connected-component detection）把餘下的資料找出來，並將其視為一個個的物體（object）。藉由以每一相連元件的黑點數為該物體的質量，我們可以依據萬有引力公式計算出每一個物體所受到的引力。然後，每一物體就依據它所受到的引力移動。由於手寫輸入資料有區域性（locality property），就是同一欄位的輸入資料會連續地寫在一個區域內。因此，我們就可利用此特性把同一欄位的資料群集（grouping）起來。藉由重複地計算引力來移動每一個物體，我們能夠決定哪些相連元件應該是屬於同一個欄位，並將其擷取出來。 In this thesis, we propose methods for solving several problems in an automatic tabular form document processing system. These problems include form structure extraction, form line removal, form document identification and field data grouping. First, a strip projection method is presented for extracting the form structure. We first segment input form images into uniform vertical and horizontal strips. Since most form lines are vertical or horizontal, we can locate form lines by projecting the image of each vertical strip horizontally and that of each horizontal strip vertically. The peak positions in these projection profiles denote possible locations of lines in form images. We then extract the lines starting with the possible line positions in the source image. After all lines have been extracted, redundant lines are removed using a line-verification algorithm and broken lines are linked using a line-merging algorithm. Experimental results show that the proposed method can extract form structures from A4-sized documents in about 3 seconds, which is very efficient, compared with the methods based on Hough transformation and run-based line-detection algorithms. Second, a form documents identification algorithm is proposed according to the similarity values of extracted form fields. The field similarity is defined by normalizing the differences of the top-left corner points and of the bottom-right corner points between the template field and the input field. Since the input form image can be de-skewed according the results of form structure extraction, each extracted field is a rectangle. Thus, a field can berepresented by the top-left and bottom-right corner points. Besides, since a short boundary line does not mean the field with this line is small, this kind of short form lines can introduce certain difficulties in form document identification if some short lines are not correctly extracted. By comparing the extracted fields, our method is much efficient than the method with comparing extracted form lines and is not less sensitive to mis-extracted form lines. A slice removing method with filled-in data preserving for form lines removal is also proposed in this thesis. In this method, we first go through a given form line to estimate its width. Then, we go through this line again to calculate he effective line width at each position, and compare with the estimated line width. The effective line width is the maximum length of the line traced from the slice position on the orthographic direction with this form line. If the effective line width at a position is larger than the estimated line width, we reserve the line slice at this position since this slice should be located on filled-in data. Otherwise, this slice is removed. This thesis also proposes a novel approach to grouping Chinese handwritten field data filled in form documents using a gravitation-based algorithm. This algorithm is developed to extract handwritten field data which may be written out of form fields. First, form lines are extracted and removed from input form images. Connected-components are then detected from remaining data, and the gravitation for each connected-component is computed by using the black pixel counts as their mass. Next, we move connected-components according to their gravitation. As generally known, filled-in data have the locality property, i.e., data of the same field are normally written in a local area consecutively. Therefore, the relationship of these connected-components can be determined by this property. Repeatedly moving these connected-components according to their neighbor components allows us to determine which connected-components should be extracted together for a particular field. Experimental results demonstrate the effectiveness of the proposed method in extracting field data.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT890392101 http://hdl.handle.net/11536/66894
顯示於類別：	畢業論文