Search code examples
perltextspeechsgml

Convert .sgm to .txt


I have some files in .sgm format and I have to evaluate them (apply a language model and obtain the perplexity of the text).

The main problem is that I need these files in plain format, i.e. in txt format. However I have been searching into the internet for an online convert or for somekind of script doing this and could not find.

Besides this, a teacher of mine sent me this command in perl:

perl -n 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;’ < file.sgm > file

I have never worked using perl and have, honestly, no idea of it. I think I have perl installed:

$ perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2013, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

By the way, I am using Mac OS X.

Sample .sgm file:

<srcset setid="newsdiscusstest2015" srclang="any">
<doc sysid="ref" docid="39-Guardian" genre="newsdiscuss" origlang="en">
<p>
<seg id="1">This is perfectly illustrated by the UKIP numbties banning people with HIV.</seg>
<seg id="2">You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome.</seg>
<seg id="3">You raise a straw man and then knock it down with thinly veiled homophobia.</seg>

Otuput .txt file:

This is perfectly illustrated by the UKIP numbties banning people with HIV. You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome. You raise a straw man and then knock it down with thinly veiled homophobia.


Solution

  • You can try using this script to strip the SGML tags from the file:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    
    use HTML::Parser;
    
    my $file = $ARGV[0];
    
    HTML::Parser->new(default_h => [""],
        text_h => [ sub { print shift }, 'text' ]
      )->parse_file($file) or die "Failed to parse $file: $!";
    

    Use it as follows:

    ./strip_sgml.pl file.sgm > file.txt