Search code examples

Extracting records from text using Invisible XML

I have the OCR'd text of a bibliography of periodicals that contains structured entries. I would like to use the Invisible XML standard to extract and parse the entries.

Example input:

1  2  Hype.  1990?- 1993.  Frequency:  Bimonthly.  River  Edge, 

NJ.  Published  by  Word  Up!  Video,  Inc.  Last  issue  66  pages. 
Height  28  cm.  Line  drawings;  Photographs  (some  in  color); 
Commercial  advertising;  Table  of  contents.  Previous  editor(s): 
Marica  A.  Cole.  ISSN  1056-4632.  LC  card  no.  sn91-1965. 
OCLC  no.  23715422.  Subject  focus  and/or  Features:  Hip  hop 
culture,  Music,  Rap  music. 

WHi  v.l,  n.6;  v.2,  n.5  Pam  01-5450  Aug,  1992;  Aug,  1993 

6561  The  Zora  Neale  Hurston  Forum.  1986-.  Frequency: 
Semiannual.  Ruth  T.  Sheffey,  Editor,  The  Zora  Neale  Hurston 
Forum,  P.O.  Box  550,  Morgan  State  University,  Baltimore, 

MD  21239.  $15  for  individuals  and  institutions.  Telephone: 
(301)  444-3435.  Published  by  Zora  Neale  Hurston  Society. 

Last  issue  69  pages.  Last  volume  142  pages.  Height  23  cm. 
Photographs;  Table  of  contents.  ISSN  1051-6867.  LC  card  no. 
90-649339.  OCLC  no.  15610848.  Subject  focus  and/or  Features:  Hurston,  Zora  Neale,  Literature,  Literary  criticism. 
MdBMC  v.l,  n.l-v.8,  n.2  Special  Collections  Fall,  1986-Spring, 


TxDw  v.l,  n.l;  v.2,  n.l  Woman’s  Collection  Fall,  1986;  Fall,  1987 
WU  v.l,  n.l-  AP/Z893/N345  Fall,  1986
6562  Zwanna:  Son  of  Zulu.  1993-.  Frequency:  Unknown. 
Nabile  P.  Hage,  Editor,  Zwanna,  P.O.  Box  38261,  Atlanta,  GA 
30334.  Published  by  Dark  Zulu  Lies  Comics,  Inc.  Last  issue  32 
pages.  Height  28  cm.  Line  drawings  (some  in  color);  Commercial  advertising.  OCLC  no.  28389961.  Subject  focus  and/or 
Features:  Comic  books,  strips,  etc. 

WHi  v.l,  n.l  Pam  00-305  Apr/May,  1993 

Each entry begins with an entry number, followed by one or more whitespace characters, followed by descriptive text split over newlines.

iXML grammar

data: entry+ .
entry: -#a, entrynum, " "+, content .
entrynum: -digit+ .
digit: ["1"-"9"] .
content: ~[]+; -#a+ .

This initial attempt at an iXML grammar produces an ambiguous parse (using the CoffeePot iXML processor).


<data xmlns:ixml="" ixml:state="ambiguous">
    <content>2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge, NJ. Published by Word Up! Video,
      Inc. Last issue 66 pages. Height 28 cm. Line drawings; Photographs (some in color); Commercial
      advertising; Table of contents. Previous editor(s): Marica A. Cole. ISSN 1056-4632. LC card
      no. sn91-1965. OCLC no. 23715422. Subject focus and/or Features: Hip hop culture, Music, Rap
      music. WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993 6561 The Zora Neale Hurston
      Forum. 1986-. Frequency: Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston Forum,
      P.O. Box 550, Morgan State University, Baltimore, MD 21239. $15 for individuals and
      institutions. Telephone: (301) 444-3435. Published by Zora Neale Hurston Society. Last issue
      69 pages. Last volume 142 pages. Height 23 cm. Photographs; Table of contents. ISSN 1051-6867.
      LC card no. 90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale,
      Literature, Literary criticism. MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,
      1994 TxDw v.l, n.l; v.2, n.l Woman’s Collection Fall, 1986; Fall, 1987 WU v.l, n.l-
      AP/Z893/N345 Fall, 1986</content>
    <content>Zwanna: Son of Zulu. 1993-. Frequency: Unknown. Nabile P. Hage, Editor, Zwanna, P.O.
      Box 38261, Atlanta, GA 30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32 pages.
      Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961.
      Subject focus and/or Features: Comic books, strips, etc. WHi v.l, n.l Pam 00-305 Apr/May, 1993

As a start, I would like to understand how to chunk the entries, and then begin to parse the content: e.g., each entry number is followed by one or more spaces, then an alphanumeric title, which is followed by period, etc.


  • Your grammar is very very ambiguous, because "~[]" includes #a, so there are dozens of ways to parse the input. You have to determine how to unambiguously identify the start of an entry, and if that is 'if it starts with a number', then you also have to prevent lines that begin with a number from being recognised as 'content', for example,

    content: line+.
    line: ~["0"-"9"], ~[#a]*, #a.

    If you want to track down ambiguity, you can try my implementation ( which is much slower than Norm's, but gives potentially useful information about the source of ambiguity.

    Here is a reasonable first try for your content, but note that that lone 1994 in the content gets treated as an entry number:

    ocr: entry+.
    entry: numbered, unnumbered*.
    -numbered: number, (line*; -#a), blank-line.
    -blank-line: -#a.
    -line: ~[#a]+, -#a.
    @number: ["0"-"9"]+, -" ".
    -unnumbered: ~["0"-"9"; #a], line+, blank-line.