Search code examples
xmlperlparsingsgml

Correct syntax for parsing an SGML to XML file using Perl?


I'm a Perl newbie attempting to read an SGML file, parse it then convert it to XML so I can get the key/value pairs of all the elements. I found the SGML::DTDParse and XML::Simple modules as I think this is what I want for the task. My problem is I can't find any documentation on DTDParse or any code examples.

My code is below:

# use modules
use SGML::DTDParse;
use XML::Simple;
use Data::Dumper;

use warnings;
use strict;

my $xml;
my $data;
my $convert;

$/ = undef;
open FILE, "C:/..." or die $!;
my $file = <FILE>;

# Convert the DTD file to XML
dtdParse $file;

# Create the XML object
$xml = new XML::Simple;

# Read the XML file
$data = $xml->XMLin($file);

# print the output
print Dumper($data);

I get an error with the dtdParse $file line as follows: Can't call method "dtdParse" without a package or object reference at "my script name"

Any ideas as to the proper syntax here and is this a valid approach for the task?

I reworked the code the code again and was able to do the dtd parsing with this:

$dtd = SGML::DTDParse::DTD->new();
$dtd->parse($file);
print $dtd;

I don't believe the parsed file can be considered xml though, so maybe the correct way to get all the elements from the parsed file is a for loop.


Solution

  • There is no function dtdParse.

    dtdparse is a program coming with the SGML::DTDParse module.

    You can use it to dump xml from a dtd file. A quick example how you could use dtdparse:

    use strict;
    use warnings;
    
    use SGML::DTDParse;
    use XML::Simple;
    use Data::Dumper;
    
    # Convert the DTD file to XML
    my $result = qx{dtdparse test.dtd};
    
    # Create the XML object
    my $xml = new XML::Simple;
    
    # Read the XML file
    $result = $xml->XMLin($result);
    
    # print the output
    $Data::Dumper::Indent = 1;
    print Dumper($result);
    

    where test.dtd looks like this:

    <?xml version="1.0" encoding="UTF-8"?>
    <!ELEMENT DatabaseInventory (DatabaseName+)>
    <!ELEMENT DatabaseName (   GlobalDatabaseName
                             , OracleSID
                             , DatabaseDomain
                             , Administrator+
                             , DatabaseAttributes
                             , Comments)
    >
    <!ELEMENT GlobalDatabaseName (#PCDATA)>
    <!ELEMENT OracleSID          (#PCDATA)>
    <!ELEMENT DatabaseDomain     (#PCDATA)>
    <!ELEMENT Administrator      (#PCDATA)>
    <!ELEMENT DatabaseAttributes EMPTY>
    <!ELEMENT Comments           (#PCDATA)>
    
    <!ATTLIST Administrator       EmailAlias CDATA #REQUIRED>
    <!ATTLIST Administrator       Extension  CDATA #IMPLIED>
    <!ATTLIST DatabaseAttributes  Type       (Production|Development|Testing) #REQUIRED>
    <!ATTLIST DatabaseAttributes  Version    (7|8|8i|9i) "9i">
    
    <!ENTITY AUTHOR "Jeffrey Hunter">
    <!ENTITY WEB    "www.iDevelopment.info">
    <!ENTITY EMAIL  "[email protected]">
    

    Which will output something like this:

    $VAR1 = {
      'namecase-entity' => '0',
      'created-by' => 'DTDParse V2.00',
      'public-id' => '',
      'version' => '1.0',
      'attlist' => {
        'DatabaseAttributes' => {
          'attribute' => {
            'Type' => {
              'value' => 'Production Development Testing',
              'type' => '#REQUIRED',
              'default' => '',
              'enumeration' => 'yes'
            },
            'Version' => {
              'value' => '7 8 8i 9i',
              'type' => '',
              'default' => '9i',
              'enumeration' => 'yes'
            }
          },
          'attdecl' => '  Type       (Production|Development|Testing) #REQUIRED'
        },
        'Administrator' => {
          'attribute' => {
            'EmailAlias' => {
              'value' => 'CDATA',
              'type' => '#REQUIRED',
              'default' => ''
            },
            'Extension' => {
              'value' => 'CDATA',
              'type' => '#IMPLIED',
              'default' => ''
            }
          },
          'attdecl' => '       EmailAlias CDATA #REQUIRED'
        }
      },
      'element' => {
        'OracleSID' => {
          'content-type' => 'mixed',
          'content-model-expanded' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          },
          'content-model' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          }
        },
        'Comments' => {
          'content-type' => 'mixed',
          'content-model-expanded' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          },
          'content-model' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          }
        },
        'DatabaseAttributes' => {
          'content-type' => 'element',
          'content-model-expanded' => {
            'empty' => {}
          },
          'content-model' => {
            'empty' => {}
          }
        },
        'GlobalDatabaseName' => {
          'content-type' => 'mixed',
          'content-model-expanded' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          },
          'content-model' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          }
        },
        'Administrator' => {
          'content-type' => 'mixed',
          'content-model-expanded' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          },
          'content-model' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          }
        },
        'DatabaseInventory' => {
          'content-type' => 'element',
          'content-model-expanded' => {
            'sequence-group' => {
              'element-name' => {
                'occurrence' => '+',
                'name' => 'DatabaseName'
              }
            }
          },
          'content-model' => {
            'sequence-group' => {
              'element-name' => {
                'occurrence' => '+',
                'name' => 'DatabaseName'
              }
            }
          }
        },
        'DatabaseDomain' => {
          'content-type' => 'mixed',
          'content-model-expanded' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          },
          'content-model' => {
            'sequence-group' => {
              'pcdata' => {}
            }
          }
        },
        'DatabaseName' => {
          'content-type' => 'element',
          'content-model-expanded' => {
            'sequence-group' => {
              'element-name' => {
                'Comments' => {},
                'OracleSID' => {},
                'DatabaseAttributes' => {},
                'DatabaseDomain' => {},
                'GlobalDatabaseName' => {},
                'Administrator' => {
                  'occurrence' => '+'
                }
              }
            }
          },
          'content-model' => {
            'sequence-group' => {
              'element-name' => {
                'Comments' => {},
                'OracleSID' => {},
                'DatabaseAttributes' => {},
                'DatabaseDomain' => {},
                'GlobalDatabaseName' => {},
                'Administrator' => {
                  'occurrence' => '+'
                }
              }
            }
          }
        }
      },
      'entity' => {
        'WEB' => {
          'text-expanded' => 'www.iDevelopment.info',
          'text' => 'www.iDevelopment.info',
          'type' => 'gen'
        },
        'AUTHOR' => {
          'text-expanded' => 'Jeffrey Hunter',
          'text' => 'Jeffrey Hunter',
          'type' => 'gen'
        },
        'EMAIL' => {
          'text-expanded' => '[email protected]',
          'text' => '[email protected]',
          'type' => 'gen'
        }
      },
      'system-id' => 'test.dtd',
      'unexpanded' => '1',
      'created-on' => 'Tue Feb 28 00:44:52 2012',
      'declaration' => '',
      'xml' => '0',
      'title' => '?untitled?',
      'namecase-general' => '1'
    };