Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)

新納, 浩幸 / 佐々木, 稔; SHINNOU, Hiroyuki / SASAKI, Minoru

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

{"_buckets": {"deposit": "5e0819cd-8e9b-42d1-855e-f63e67661643"}, "_deposit": {"created_by": 1, "id": "10500", "owners": [1], "pid": {"revision_id": 0, "type": "depid", "value": "10500"}, "status": "published"}, "_oai": {"id": "oai:rose-ibadai.repo.nii.ac.jp:00010500", "sets": ["304"]}, "author_link": ["22482", "22483", "22484"], "item_11_alternative_title_22": {"attribute_name": "タイトル（ヨミ）", "attribute_value_mlt": [{"subitem_alternative_title": "Web ページ ナイ ノ モクテキ ブブン ノ ジドウ チュウシュツ ジョウホウ チュウシュツ ホンヤク チシキ カクトク"}]}, "item_11_biblio_info_8": {"attribute_name": "書誌情報", "attribute_value_mlt": [{"bibliographicIssueDates": {"bibliographicIssueDate": "2004-07-15", "bibliographicIssueDateType": "Issued"}, "bibliographicIssueNumber": "73", "bibliographicPageEnd": "40", "bibliographicPageStart": "33", "bibliographicVolumeNumber": "2004", "bibliographic_titles": [{"bibliographic_title": "情報処理学会研究報告. 自然言語処理研究会報告"}]}]}, "item_11_description_16": {"attribute_name": "フォーマット", "attribute_value_mlt": [{"subitem_description": "application/pdf", "subitem_description_type": "Other"}]}, "item_11_description_46": {"attribute_name": "資源タイプ", "attribute_value_mlt": [{"subitem_description": "テクニカルレポート", "subitem_description_type": "Other"}]}, "item_11_description_7": {"attribute_name": "内容記述", "attribute_value_mlt": [{"subitem_description": "本論文ではWebページから目的部分のテキストを自動抽出する手法を提案する.本論文で扱うタスクは,Webニュースのページからそのニュース記事のタイトルと本文を抽出するというタスクである.本手法ではまずテキストブラウザを利用して,Webページをテキスト化する.このテキストファイルをもとに抽出規則の学習を行なう.具体的には行を事例としたSTART/END法とクラス間の出現順序や位置情報などの制約を取り入れた状態遷移図を利用する.本手法はWrapper学習の一種であるが,従来までのWrapper学習とは異なり,HTMLのタグを抽出手がかりとして使わない.そのためにサイトの異なるページに対しても適用できる抽出規則を学習することが期待できる.実験では訓練データの元になったサイトから取り出したページと別サイトから取り出したページを使って抽出実験を行なった.単純なレイアウトのページであれば,高精度に抽出できたが,複雑なレイアウトのページでは抽出に失敗していた.また本手法は様々な応用が可能である.ここでは対訳コーパスの自動構築に応用できることを示した.今後は自然言語の情報を素性に組み入れる.本タスクに関しては,タイトルの判定の精度を高めて改善を行なう.", "subitem_description_type": "Other"}, {"subitem_description": "This paper proposes a new method to extract target parts from a web page. Our task is to extract the title and the article from a web news page. First, our method translates the HTML formatted web page into the plain text file, and then learns the extraction rule by using such plain text files. In concrete, we use the START/END method using a line as an instance and the state transition diagram incorporating constrains of the class sequence, the distance between classes and so on. Our method is a Wrapper learning method. However, our method does not use HTML tags as clues for extraction, unlike traditional Wrapper learning methods. Therefore, our method might be expected to learn the extraction rule which can be applied to other various site pages. We conducted experiments using other pages on the same site and pages on the other site. The extraction rule learned by our method worked well for pages with the conducted the experiment constructing a bilingual corpus automatically, to introduce the wide usefulness of our method. In future, we will use the language information as the features, and improve the judgment of the title part for this task.", "subitem_description_type": "Other"}]}, "item_11_full_name_4": {"attribute_name": "著者(ヨミ)", "attribute_value_mlt": [{"nameIdentifiers": [{"nameIdentifier": "22483", "nameIdentifierScheme": "WEKO"}], "names": [{"name": "シンノウ, ヒロユキ / ササキ, ミノル"}]}]}, "item_11_publisher_35": {"attribute_name": "出版者", "attribute_value_mlt": [{"subitem_publisher": "情報処理学会"}]}, "item_11_relation_54": {"attribute_name": "異版である", "attribute_value_mlt": [{"subitem_relation_type": "isVersionOf", "subitem_relation_type_id": {"subitem_relation_type_id_text": "http://ci.nii.ac.jp/naid/110002911725/", "subitem_relation_type_select": "URI"}}]}, "item_11_rights_14": {"attribute_name": "権利", "attribute_value_mlt": [{"subitem_rights": "情報処理学会"}, {"subitem_rights": "本文データは学協会の許諾に基づきCiNiiから複製したものである"}]}, "item_11_source_id_11": {"attribute_name": "書誌レコードID", "attribute_value_mlt": [{"subitem_source_identifier": "AN10115061", "subitem_source_identifier_type": "NCID"}]}, "item_11_text_1": {"attribute_name": "Title in Japanese", "attribute_value_mlt": [{"subitem_text_value": "Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)"}]}, "item_11_text_17": {"attribute_name": "形態", "attribute_value_mlt": [{"subitem_text_value": "858464 bytes"}]}, "item_11_text_2": {"attribute_name": "タイトル（英）", "attribute_value_mlt": [{"subitem_text_language": "en", "subitem_text_value": "Automatic extracion of target parts from a web page"}]}, "item_11_text_36": {"attribute_name": "出版者（ヨミ）", "attribute_value_mlt": [{"subitem_text_value": "ジョウホウ ショリ ガッカイ"}]}, "item_11_text_37": {"attribute_name": "別言語の出版者", "attribute_value_mlt": [{"subitem_text_value": "Information Processing Society of Japan"}]}, "item_11_text_47": {"attribute_name": "資源タイプ・ローカル", "attribute_value_mlt": [{"subitem_text_value": "テクニカル・レポート"}]}, "item_11_text_48": {"attribute_name": "資源タイプ・NII", "attribute_value_mlt": [{"subitem_text_value": "Technical Report"}]}, "item_11_text_49": {"attribute_name": "資源タイプ・DCMI", "attribute_value_mlt": [{"subitem_text_value": "text"}]}, "item_11_text_50": {"attribute_name": "資源タイプ・ローカル表示コード", "attribute_value_mlt": [{"subitem_text_value": "04"}]}, "item_11_text_76": {"attribute_name": "URI", "attribute_value_mlt": [{"subitem_text_value": "http://hdl.handle.net/10109/1786"}]}, "item_11_text_79": {"attribute_name": "コメント", "attribute_value_mlt": [{"subitem_text_value": "ここに掲載した著作物の利用に関する注意 本著作物の著作権は（社）情報処理学会に帰属します。本著作物は著作権者である情報処理学会の許可のもとに掲載するものです。ご利用に当たっては「著作権法」ならびに「情報処理学会倫理綱領」に従うことをお願いいたします。Notice for the use of this material The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). This material is published on this web site with the agreement of the author (s) and the IPSJ. Please be complied with Copyright Law of Japan and the Code of Ethics of the IPSJ if any users wish to reproduce, make derivative work, distribute or make available to the public any part or whole thereof.All Rights Reserved, Copyright (C) Information Processing Society of Japan."}]}, "item_11_version_type_18": {"attribute_name": "著者版フラグ", "attribute_value_mlt": [{"subitem_version_resource": "http://purl.org/coar/version/c_970fb48d4fbd8a85", "subitem_version_type": "VoR"}]}, "item_creator": {"attribute_name": "著者", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "新納, 浩幸 / 佐々木, 稔"}], "nameIdentifiers": [{"nameIdentifier": "22482", "nameIdentifierScheme": "WEKO"}]}, {"creatorNames": [{"creatorName": "SHINNOU, Hiroyuki / SASAKI, Minoru", "creatorNameLang": "en"}], "nameIdentifiers": [{"nameIdentifier": "22484", "nameIdentifierScheme": "WEKO"}]}]}, "item_files": {"attribute_name": "ファイル情報", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_date", "date": [{"dateType": "Available", "dateValue": "2020-02-21"}], "displaytype": "detail", "download_preview_message": "", "file_order": 0, "filename": "20100351.pdf", "filesize": [{"value": "858.5 kB"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 858500.0, "url": {"label": "20100351.pdf", "url": "https://rose-ibadai.repo.nii.ac.jp/record/10500/files/20100351.pdf"}, "version_id": "e088c80f-80d7-4ac7-93bc-e9f55d53f2b6"}]}, "item_language": {"attribute_name": "言語", "attribute_value_mlt": [{"subitem_language": "jpn"}]}, "item_resource_type": {"attribute_name": "資源タイプ", "attribute_value_mlt": [{"resourcetype": "technical report", "resourceuri": "http://purl.org/coar/resource_type/c_18gh"}]}, "item_title": "Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)", "item_titles": {"attribute_name": "タイトル", "attribute_value_mlt": [{"subitem_title": "Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)"}, {"subitem_title": "Automatic extracion of target parts from a web page", "subitem_title_language": "en"}]}, "item_type_id": "11", "owner": "1", "path": ["304"], "permalink_uri": "http://hdl.handle.net/10109/1786", "pubdate": {"attribute_name": "公開日", "attribute_value": "2011-01-06"}, "publish_date": "2011-01-06", "publish_status": "0", "recid": "10500", "relation": {}, "relation_version_is_last": true, "title": ["Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)"], "weko_shared_id": -1}

Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)

http://hdl.handle.net/10109/1786

名前 / ファイル	ライセンス	アクション
20100351.pdf (858.5 kB)

Item type

テクニカルレポート / Technical Report(1)

公開日

2011-01-06

タイトル

Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)

タイトル

言語

タイトル

Automatic extracion of target parts from a web page

言語

jpn

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_18gh

資源タイプ

technical report

Title in Japanese

Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)

タイトル（英）

Automatic extracion of target parts from a web page

著者

新納, 浩幸 / 佐々木, 稔

SHINNOU, Hiroyuki / SASAKI, Minoru

著者(ヨミ)

姓名

シンノウ, ヒロユキ / ササキ, ミノル

内容記述

内容記述タイプ

Other

内容記述

本論文ではWebページから目的部分のテキストを自動抽出する手法を提案する.本論文で扱うタスクは,Webニュースのページからそのニュース記事のタイトルと本文を抽出するというタスクである.本手法ではまずテキストブラウザを利用して,Webページをテキスト化する.このテキストファイルをもとに抽出規則の学習を行なう.具体的には行を事例としたSTART/END法とクラス間の出現順序や位置情報などの制約を取り入れた状態遷移図を利用する.本手法はWrapper学習の一種であるが,従来までのWrapper学習とは異なり,HTMLのタグを抽出手がかりとして使わない.そのためにサイトの異なるページに対しても適用できる抽出規則を学習することが期待できる.実験では訓練データの元になったサイトから取り出したページと別サイトから取り出したページを使って抽出実験を行なった.単純なレイアウトのページであれば,高精度に抽出できたが,複雑なレイアウトのページでは抽出に失敗していた.また本手法は様々な応用が可能である.ここでは対訳コーパスの自動構築に応用できることを示した.今後は自然言語の情報を素性に組み入れる.本タスクに関しては,タイトルの判定の精度を高めて改善を行なう.

内容記述

内容記述タイプ

Other

内容記述

This paper proposes a new method to extract target parts from a web page. Our task is to extract the title and the article from a web news page. First, our method translates the HTML formatted web page into the plain text file, and then learns the extraction rule by using such plain text files. In concrete, we use the START/END method using a line as an instance and the state transition diagram incorporating constrains of the class sequence, the distance between classes and so on. Our method is a Wrapper learning method. However, our method does not use HTML tags as clues for extraction, unlike traditional Wrapper learning methods. Therefore, our method might be expected to learn the extraction rule which can be applied to other various site pages. We conducted experiments using other pages on the same site and pages on the other site. The extraction rule learned by our method worked well for pages with the conducted the experiment constructing a bilingual corpus automatically, to introduce the wide usefulness of our method. In future, we will use the language information as the features, and improve the judgment of the title part for this task.

書誌情報

情報処理学会研究報告. 自然言語処理研究会報告

巻 2004, 号 73, p. 33-40, 発行日 2004-07-15

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN10115061

権利

権利情報

情報処理学会

権利

権利情報

本文データは学協会の許諾に基づきCiNiiから複製したものである

フォーマット

内容記述タイプ

Other

内容記述

application/pdf

著者版フラグ

出版タイプ

VoR

出版タイプResource

http://purl.org/coar/version/c_970fb48d4fbd8a85

タイトル（ヨミ）

その他のタイトル

Web ページナイノモクテキブブンノジドウチュウシュツジョウホウチュウシュツホンヤクチシキカクトク

出版者

情報処理学会

出版者（ヨミ）

ジョウホウショリガッカイ

別言語の出版者

Information Processing Society of Japan

資源タイプ

内容記述タイプ

Other

内容記述

テクニカルレポート

資源タイプ・NII

Technical Report

資源タイプ・DCMI

text

異版である

Versions

Ver.1

2023-05-15 17:04:07.101311

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

Webページ内の目的部分の自動抽出(情報抽出・翻訳知識獲得)

× 新納, 浩幸 / 佐々木, 稔

× SHINNOU, Hiroyuki / SASAKI, Minoru

Versions

Share

Cite as

エクスポート