Semi-structured text parsing

The problem

In order to classify, index and present semi-structured text we need to parse the texts. Parsing semi-structured text presents a number of problems:

  1. It is usually not possible, or very complicated, to specify a complete formal grammar for semi-structured text.
  1. Semi-structured text is usually not context free.
  1. Semi-structured text often means natural language texts. Different languages and script systems present different parsing challenges.

However, we have some degrees of freedom when parsing semi-structured text in the context of search engines:

  1. We are seldom interested in specifying the complete grammar of the texts. We are typically interested in identifying chapters, sections and paragraphs. Occasionally we are interested in tables or lists.
  1. We can rely on term databases for recognizing key tokens.

Landmark/keyword based parsing

Most pragmatic approaches to parsing of semi-structured text are based on hand written landmark or keyword based parsing algorithms. A landmark/keyword signals the start of an interesting part of the text and a complementary landmark/keyword marks the end of the interesting part of the text. An example:

Chapter 1: The beginning <-- Landmark "Chapter N" signals the start of a Chapter.
Chapter 2: The adventure <-- Landmark "Chapter N" signals the end of the previous and the start of the next Chapter.

In the above example we can also easily extract the title associated with each chapter. The landmark/keyword approach relies on the fact that the text is semi-structured where the landmarks are simply used to identify the structured elements in the text!

Theory and research

Since the advent of the web and search engines there has been an increased interest in the parsing and analysis of semi-structured and natural language texts. An entry point is of course the  Wikipedia article on Search Engine Indexing. Here are some papers that can be of interest:

  1. Information Extraction with and without Parsing Semi-structured Documents by Daisuke Ikeda and Yasuhiro Yamada, Computing and Communications Center, Kyushu University, Japan.
  1. Learning to Extract Information from Semi-structured Text using a Discriminative Context Free Grammar by Paul Viola and Mukund Narasimhand, Microsoft.
  1. Top-down Extraction of Semi-Structured Data by Berthier Ribeiro-Neto, Alberto H. F. Laender and Altigran S. da Silva, Federal University of Minas Gerais, Belo Horizonte, Brazil.

  1. Searching Semi-Structured Data Using Landmarks by Andrew Davison, Prince of Songkla University, Thailand.
  1. Semi-structured information extraction applying automatic pattern discovery by Chia-Hui Chang, Shao-Chen Lui and Yen-Chin Wu, National Central University, Taiwan.