Search code examples
arraysxmlperlparsingxml-simple

Parsing XML file to count tag occurrence using XML::Simple


I'm attempting to parse an XML file using XML::Simple in order to count the occurrence of specific tags(i.e. the occurrence of different city locations specific to a title that gets repeated throughout the file in order to do further analysis on the numbers produced. For example:

<XML>
   <title> Teacher </title>
   <state> TX </state>
   <city> Dallas </city>

   <title> Teacher </title>
   <state> CA </state>
   <city> Los Angeles </city>

   <title> Engineer </title>
   <state> NY </state>
   <city> Manhattan </city>

   <title> Engineer </title>
   <state> NY </state>
   <city> Manhattan </city>
</XML>

I somehow need to count the number of times the title occurs and the different # of locations so

Teacher:2 Cities:2

Engineer:2 Cities:1

What I have:

    #!/usr/bin/env perl

    use XML::Simple;
    use Data::Dumper; 

    # initialize variables
    my $counter = 0;
    my @titlelist = ();
    my @citylist = ();

    # create object
    $xml = new XML::Simple;

    # read XML file
    my $jobs = $xml->XMLin("sample.xml");

    print Dumper($jobs);

    foreach my $titles(@{$jobs->{job}}) {
        push(@citylist, $titles->{city});
        push(@titlelist, $titles->{title});
    }

    print "@titlelist\n";
    print "@citylist\n";

I know this is super basic and I haven't really produced anything, and it's because I'm a beginner who's totally lost in terms of how to approach this logically. I really need help to understand the structure that I need to use to get some kind of output resembling this, and would appreciate any pointers in the right direction. I'm basically just pushing the results to arrays right now. Should I do string comparisons, and based on that increment city and title counters? Do I need a multidimensional array for this? Any ideas would help...thank you!


Solution

  • I can try to point you in the right direction.

    First, I'm going to assume that your xml has <job> tags around each job and actualy looks like this

    <XML>
        <job>
            <title> Teacher </title>
            <state> TX </state>
            <city> Dallas </city>
         </job>
    

    Now, I'm going to suggest renaming the variables in your next bit of code to make it clearer what's going on

    my $xml_data = $xml->XMLin("sample.xml");
    
    # We want the list of things with the "<job>" tag 
    my $jobs = $xml_data->{job}; 
    
    print Dumper($jobs);   # this will now print a list (an arrayref)
    
    # Now we look at each job in the list of jobs
    # You can read this in english as "for each job in jobs"
    foreach my $job (@$jobs) {
        # each $job has a city and title:
        print "here is a job in the city $job->{city} with the title $job->{title}\n";
    }
    

    That should help you out some. At this point you're going to have to read about how hashes work in Perl. The solution is going to look something like this, but it's not going to make sense if you haven't understood hashes yet.

    $num_jobs_for{ $title } ++;
    $num_jobs_for_title_in_city{ $title }{ $city } ++
    

    Good luck! And feel free to post again when you get farther.