ダウンロード数: 99

このアイテムのファイル:
ファイル 記述 サイズフォーマット 
2833089.pdf667.59 kBAdobe PDF見る/開く
タイトル: Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese--Japanese Wikipedia
著者: Chu, Chenhui  kyouindb  KAKEN_id  orcid https://orcid.org/0000-0001-9848-6384 (unconfirmed)
Nakazawa, Toshiaki
Kurohashi, Sadao  kyouindb  KAKEN_id
著者名の別形: 中澤, 敏明
黒橋, 禎夫
発行日: Feb-2016
出版者: Association for Computing Machinery (ACM)
誌名: ACM Transactions on Asian and Low-Resource Language Information Processing
巻: 15
号: 2
論文番号: 10
抄録: Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, we propose an integrated system to extract both parallel sentences and fragments from comparable corpora. We first apply parallel sentence extraction to identify parallel sentences from comparable sentences. We then extract parallel fragments from the comparable sentences. Parallel sentence extraction is based on a parallel sentence candidate filter and classifier for parallel sentence identification. We improve it by proposing a novel filtering strategy and three novel feature sets for classification. Previous studies have found it difficult to accurately extract parallel fragments from comparable sentences. We propose an accurate parallel fragment extraction method that uses an alignment model to locate the parallel fragment candidates and an accurate lexicon-based filter to identify the truly parallel fragments. A case study on the Chinese--Japanese Wikipedia indicates that our proposed methods outperform previously proposed methods, and the parallel data extracted by our system significantly improves SMT performance.
著作権等: © 2015 ACM. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in 'ACM Transactions on Asian and Low-Resource Language Information Processing', 15(2), 10, http://dx.doi.org/10.1145/2833089.
This is not the published version. Please cite only the published version. この論文は出版社版でありません。引用の際には出版社版をご確認ご利用ください。
URI: http://hdl.handle.net/2433/265843
DOI(出版社版): 10.1145/2833089
出現コレクション:学術雑誌掲載論文等

アイテムの詳細レコードを表示する

Export to RefWorks


出力フォーマット 


このリポジトリに保管されているアイテムはすべて著作権により保護されています。