Search code examples
linuxwindowsperlsql-insertquery-performance

Add eight thousand line in one commit


I have a log file, perl script who took the log file and transcript the log file, i want to send this all line (eight thousand line) in one commit

my script:

# Connect to the database.
my $dbh = DBI->connect(
    "DBI:mysql:database=DB;host=>IP",
    "hostname", 'password',
    {'RaiseError' => 1,'AutoCommit'=> 0}
);

    open (FILE, 'file.log');
    while (<FILE>) {

        ($word1, $word2, $word3, $word4, $word5, $word6, $word7, $word8, $word9, $word10, $word11, $word12, $word13, $word14) = split(" ");

        $word13 =~ s/[^\d.]//g;
        if ($word2 eq "Feb") {
                $word2 = "02"  
        }

        print "'$word5-$word2-$word3 $word4', $word11, $word13 \n";

        eval {
            #on peut utiliser insert mais il y aura des doublons et ici on est dans une table unique
            my $sth = $dbh->prepare("INSERT INTO `test_query` (time, cstep, time_in_seconde) VALUES('$word5-$word2-$word3 $word4', $word11, $word13);");

            #print $sth->rows . " rows found.\n";
            #$sth->finish;          

            # do inserts, updates, deletes, queries here
            #$sth->execute() or die "execution failed: $dbh->errstr()";
            $sth->execute() or die "execution failed: $dbh->errstr()";
            $dbh->commit();

        };

        ### If something went wrong...

    }
}

$dbh->disconnect();

thanks


Solution

  • For better performance, you want to simplify your code and move as many code as possible out of the loop :

    • prepare the statement out of the loop, using bind parameters : the statement is always the same, only bind parameters get to change

    • commit out of the loop : this will increase performance and also has the advantage of making your process atomic. As all changes occur within the same database transaction, either all lines will be processed (and commited), or, if a failure occurs on any line, no line at all will be commited. While implementing this optimization you need to watch for resource usage on your database (this will typically require more space in the UNDO tablespace) ; if resources are not enough, either increase them or commit every Nth record (with N being as high as possible)

    • avoid printing inside the loop unless your really need it (I commented that line)

    • you are building a connection with RaiseError attribute enabled, but then you ignore errors that can occur at execute. If this is really what you want, then just disable the RaiseError attribute on the statement handler, and remove the eval around execute

    Other considerations in terms of coding practices :

    • always use strict and use warnings

    • use an array to store parsed data instead of a list of scalars : could make your code faster and will make it more readable

    Code :

    use strict;
    use warnings;
    
    # Connect to the database.
    my $dbh = DBI->connect(
        "DBI:mysql:database=DB;host=>IP",
        "hostname", 'password',
        {'RaiseError' => 1,'AutoCommit'=> 0}
    );
    
    # prepare the insert statement
    my $sth = $dbh->prepare("INSERT INTO `test_query` (time, cstep, time_in_seconde) VALUES(?, ?, ?)");
    $sth->{RaiseError} = 0;
    
    open (my $file, 'file.log') or die "could not open : $!";
    while (<$file>) {
        my @words = split / /;
        $words[12] =~ s/[^\d.]//g;
        if ($words[1] eq "Feb") {
                $words[1] = "02" ;
        }
    
        # print "'$words[4]-$words[1]-$words[2] $words[3]', $words[10], $words[12] \n";
        $sth->execute( "$words[4]-$words[1]-$words[2] $words[3]", $words[10], $words[12] );
    
    }
    
    $dbh->commit;
    $dbh->disconnect;
    

    The last solution, that would probably perform even faster than this one, is to use DBI method execute_array to perform bulk database inserts. Attribute ArrayTupleFetch can be used to provide a code reference that DBI will invoke everytime it is ready to perform the next INSERT : this code reference should read the next file line and provide an array reference of values suitable for INSERT. When the file is exhausted, the sub should return undef, which will indicate DBI that the bulk process is completed.

    Code :

    #!/usr/local/bin/perl
    
    use strict;
    use warnings;
    use DBI;
    
    # open the file
    open (my $file, 'log.file') or die "could not open : $!";
    
    # connect the database
    my $dbh = DBI->connect("DBI:mysql:database=DB;host=ip", "hostname", 'password', {'RaiseError' => 1,'AutoCommit'=> 0});
    
    # prepare the INSERT statement
    my $sth = $dbh->prepare("INSERT INTO `test_query` (time, cstep, time_in_seconde) VALUES(?, ?, ?)");
    
    # run bulk INSERTS
    my $tuples = $sth->execute_array({ 
        ArrayTupleStatus => \my @tuple_status,
        ArrayTupleFetch => sub {
            my $line = <$file>;
            return unless $line;
            my @words = split / /;
            # ... do anything you like with the array, then ...
            return [ "$words[4]-$words[1]-$words[2] $words[3]", $words[10], $words[12] ];
        }
    });
    
    if ($tuples) {
        print "Successfully inserted $tuples records\n";
    } else {
        # do something usefull with @tuple_status, that contains the detailed results
    }
    
    $dbh->commit;
    $dbh->disconnect;