AUTOMATIC MAIN TEXT EXTRACTION FROM WEB PAGES

Phan Thi Ha, Ha Hai Nam

Abstract


This paper presents a novel method for extracting body text from web pages used for building text corpus. The algorithm for extracting body text proposed by Aidan Finn [1] is extended with some enhancements in this research. The experimental results on a set of websites show that the proposed method significantly improves the performance of body text extraction without decrease in accuracy compared to the original algorithm.

Keywords


HTML, BTE, body text etraction, main content text



DOI: https://doi.org/10.15625/0866-708X/51/1/9557 Display counter: Abstract : 103 views. PDF (Tiếng Việt) : 76 views. PDF (Tiếng Việt) : 47 views.

Refbacks

  • There are currently no refbacks.


Bioteknologi Agrikultur

Index: Google Scholar; Crossref; VCGate; Asean Citation Index

Published by Vietnam Academy of Science and Technology