Search code examples
pythonxmlinformation-retrievalsgml

Extract plain text from SGML


I have a list of 528k documents which are in SGML format, an example of one of the documents is as follows:

<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT>    "jpuma009__l94008" </HT>


<HEADER>
<AU>   JPRS-UMA-94-009-L </AU>
JPRS 
Central Eurasia 

</HEADER>

<ABS>  Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 &amp; 2, </ABS>


<TEXT>
1993 
<DATE1>   17 June 1994 </DATE1>
<F P=100></F>
<F P=101>   Arms, Military Equipment </F>
<H3> <TI>   `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE>    `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
  Cooperation 

<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA, 
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>

22-28--FOR OFFICIAL USE ONLY 
<F P=103> 94UM0312D </F>
<F P=104>  Moscow VOORUZHENIYE, POLITIKA, 
KONVERSIYA </F>

<F P=105>  Russian </F>
CSO 

<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........ 

</TEXT>

</DOC>

I want to extract palin text between <TEXT></TEXT>, the desired result is as follows:

1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
94UM0312D Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........

Is there a library or tools in Python/Java that that allows doing that ?


Solution

  • You could use BeautifulSoup in python

    I tried this code and got the required output.

    from bs4 import BeautifulSoup
    with open('file.txt','r') as fo:
        sgml=fo.read()
    soup = BeautifulSoup(sgml,'html.parser')
    text_list=soup.find_all('text')
    for item in text_list:
        lines_in_item=item.text.split('\n')
        [print(x.strip()) for x in lines_in_item if x.strip()!=""]
    

    Output

    1993
    17 June 1994
    Arms, Military Equipment
    `Vympel' State Machinebuilding Design Bureau Proposes
    `Vympel' State Machinebuilding Design Bureau Proposes
    Cooperation
    94UM0312D Moscow VOORUZHENIYE, POLITIKA,
    KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
    22-28--FOR OFFICIAL USE ONLY
    94UM0312D
    Moscow VOORUZHENIYE, POLITIKA,
    KONVERSIYA
    Russian
    CSO
    [Article by "Vympel" State Machinebuilding Design Bureau
    Lorem ipsum ........
    

    file.txt

    <DOC>
    <DOCNO> FBIS4-46571 </DOCNO>
    <HT>    "jpuma009__l94008" </HT>
    
    
    <HEADER>
    <AU>   JPRS-UMA-94-009-L </AU>
    JPRS
    Central Eurasia
    
    </HEADER>
    
    <ABS>  Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 &amp; 2, </ABS>
    
    
    <TEXT>
    1993
    <DATE1>   17 June 1994 </DATE1>
    <F P=100></F>
    <F P=101>   Arms, Military Equipment </F>
    <H3> <TI>   `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
    <HT><F P=107><PHRASE>    `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
      Cooperation
    
    <F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA,
    KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>
    
    22-28--FOR OFFICIAL USE ONLY
    <F P=103> 94UM0312D </F>
    <F P=104>  Moscow VOORUZHENIYE, POLITIKA,
    KONVERSIYA </F>
    
    <F P=105>  Russian </F>
    CSO
    
    <F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
    Lorem ipsum ........
    
    </TEXT>
    
    </DOC>