Search code examples
perlhashexistsperl-data-structures

Avoid DBI duplicate primary key with exists operator


I want to insert information from a FASTA file into a table in a MySQL database. I used the Ensembl_id column as the primary key.

Some of my Ensembl_id are not unique, so I tried to use the exists operator to overcome this issue. But only 5 rows had been inserted into the table, and among them only one with duplicate Ensembl_id values.

#!/usr/bin/perl -w

#usage script.pl <username> <password> <database_name> <mouse_genes> <mouse_transcripts>

use DBI;
use Data::Dumper;

my $user              = shift @ARGV or die $!;
my $password          = shift @ARGV or die $!;
my $database          = shift @ARGV or die $!;
my $mouse_genes       = shift @ARGV or die $!;
my $mouse_transcripts = shift @ARGV or die $!;

my $dbh = DBI->connect( "dbi:mysql:$database:localhost", "$user", "$password",
    { RaiseError => 1 } );
my %gene;

$/ = "\n>";

open( FILE, "gzip -d -c /data.dash/class2016/student/Mus_musculus.GRCm38.cdna.all.fa.gz |" )
        or die $!;

LOOP:
while ( <FILE> ) {

    my $line = $_;
    chomp $line;

    if ( $line =~ /[a-z]/ ) {

        my @array = split( "\t", $line );

        if ( m/gene:(\w+\d+\.\w+)/ ) {

            my $Ensembl_id = $1;

            if ( !exists $gene{$Ensembl_id} ) {
                $gene{$Ensembl_id} = 1;
            }
            else {
                next;
            }

            if ( m/gene_biotype:(\w+)/ ) {

                my $gene_biotype = $1;

                if ( m/gene_symbol:(\w+\D\d+)/ ) {

                    my $gene_symbol = $1;

                    if ( m/description:(\w+\s+\w+\s+\w+\s+)/ ) {

                        my $gene_description = $1;

                        if ( m/MGI:(\d+)/ ) {

                            my $MGI_accession = $1;
                            my $sth           = $dbh->prepare(
                                qq{insert into $mouse_genes (Ensembl_id,gene_biotype,gene_symbol,gene_description,MGI_accession) values ("$Ensembl_id","$gene_biotype","$gene_symbol","$gene_description","$MGI_accession")}
                            );
                            $sth->execute();
                            $sth->finish();

                            next LOOP;
                        }
                    }
                }
            }
        }
    }
}

close FILE;

$dbh->disconnect();

How I can use the exists operator to move to the next line of the file if the primary key $Ensembl_id is duplicated?


Solution

  • I thought I saw a question very similar to this, but I can't find it

    The solution is to forget about hashes and use the IGNORE keyword to avoid raising an error. The MySQL documentation says this:

    If you use the IGNORE keyword, errors that occur while executing the INSERT statement are ignored. For example, without IGNORE, a row that duplicates an existing UNIQUE index or PRIMARY KEY value in the table causes a duplicate-key error and the statement is aborted. With IGNORE, the row is discarded and no error occurs. Ignored errors may generate warnings instead, although duplicate-key errors do not.

    You should also use placeholders in your SQL statement, so it should look like this

    Note that END_SQL must appear without and spaces before or after it. You may want to define the SQL statement at the top of your program to avoid spoiling the indenting

    my $sth = $dbh->prepare(<<END_SQL);
    INSERT IGNORE INTO $mouse_genes (
        Ensembl_id,
        gene_biotype,
        gene_symbol,
        gene_description,
        MGI_accession
    )
    VALUES ( ?, ?, ?, ?, ? )
    END_SQL
    
    $sth->execute($Ensembl_id, $gene_biotype, $gene_symbol, $gene_description, $MGI_accession);
    


    Update

    Your program can be tidied up a lot to make it easier to read. Here's how I would write it

    #!/usr/bin/perl
    
    use strict;
    use warnings 'all';
    
    # usage script.pl <username> <password> <database_name> <mouse_genes> <mouse_transcripts>
    
    use DBI;
    
    my $user              = shift @ARGV or die $!;
    my $password          = shift @ARGV or die $!;
    my $database          = shift @ARGV or die $!;
    my $mouse_genes       = shift @ARGV or die $!;
    my $mouse_transcripts = shift @ARGV or die $!; # Not used at present
    
    my $dbh = DBI->connect( "dbi:mysql:$database:localhost", $user, $password,
            { RaiseError => 1, PrintError => 0 } );
    
    my $sth = $dbh->prepare( <<END_SQL );
    INSERT IGNORE INTO $mouse_genes (
        Ensembl_id,
        gene_biotype,
        gene_symbol,
        gene_description,
        MGI_accession
    )
    VALUES ( ?, ?, ?, ?, ? )
    END_SQL
    
    my $cmd = 'gzip -d -c /data.dash/class2016/student/Mus_musculus.GRCm38.cdna.all.fa.gz';
    
    open my $cmd_fh, '-|', $cmd or die $!;
    
    $/ = "\n>";
    
    while ( <$cmd_fh> ) {
    
        next unless my ( $ensembl_id )       = /gene:(\w+\d+\.\w+)/;
        next unless my ( $gene_biotype )     = /gene_biotype:(\w+)/;
        next unless my ( $gene_symbol )      = /gene_symbol:(\w+\D\d+)/;
        next unless my ( $gene_description ) = /description:(\w+\s+\w+\s+\w+)\s/;
        next unless my ( $mgi_accession )    = /MGI:(\d+)/;
    
        $sth->execute( $ensembl_id, $gene_biotype, $gene_symbol, $gene_description, $mgi_accession );
    }
    
    $dbh->disconnect;