Search code examples
rustlarge-files

how to quickly insert 1 million data into mysql using rust


I'm still learning rust, to process data I find rust runs very fast, but when try to inserting data into mysql I haven't found a way to do it quickly (compared to me doing it in python in under 1 minutes, rust need 15+ minutes).

I'm using rust library "mysql" to insert data into db with code:

use mysql::*;

#[derive(Debug, PartialEq, Eq)]
struct ListPhone {
    phone_no: String,
}

fn insert_mysql(data: Vec<ListPhone>) -> () {
    let url = "mysql://user:pass@xxx.xxx.xxx.xxx:3306/learn";

    let pool = mysql::Pool::new(url);

    let mut conn = pool.expect("error pool conn").get_conn();
    let res = conn.expect("error running").exec_batch(
        r"INSERT INTO listPhone (phone_no)
          VALUES (:phone_no)",
        data.into_iter().map(|p| {
            params! {
                "phone_no" => p.phone_no,
            }
        }),
    );
    println!("done")
}

Has anyone explored inserting large datasets into mysql using rust language quickly (batch, row), or is there any library that can help?

Edit : when I log all the queries I can see the difference between the queries generated in python and my rust script. in python the script will be like this:

insert into listPhone (phone_no) values (1),(2),(3)... all the data

and then executing it one statement

, but in rust it will generate :

"insert into listPhone (phone_no) values (1)",

"insert into listPhone (phone_no) values (2)",

"insert into listPhone (phone_no) values (3)", ... the rest of the data

and execute for each statement.

is there a way to convert the resulting statement into a single statement?

edit 2:

i tried this solution in idiomatic way, i don't know about efficiency but this solution can do it with faster time compared to the python script i have. but I still hope someone has a more elegant solution.

fn insert_mysql_batch_experiment(data: Vec<ListPhone>) -> () {
     let mut vec = Vec::new();
    let url = env::var("DATABASE_URL").expect("load env failed");
    let pool = mysql::Pool::new(&*url).expect("error pool conn");
    let mut conn = pool.get_conn().expect("error conn");

    for x in data.iter() {
        vec.push(x.phone_no.clone());
    }
    let hasil = vec.join("'),('");
    let final_str = [
        "INSERT INTO listPhone (phone_no) VALUES ",
        "('",
        &hasil,
        "')",
    ]
    .join("");

    {
        let query_result = conn.query_iter(final_str);
        query_result.expect("error inserting data");
    }
}

Solution

  • From what I can see, if you don't execute the inserts in a transaction, exec_batch will execute each statement independently, without a transaction. Executing all inserts in the same transaction should give you a significant speed-up:

    let pool = mysql::Pool::new(url).expect("error pool conn");
    let mut conn = pool.get_conn().expect("error conn");
    let mut trans = conn
        .start_transaction(TxOpts::default())
        .expect("error trans");
    let res = trans.exec_batch(
        r"INSERT INTO listPhone (phone_no)
          VALUES (:phone_no)",
        data.into_iter().map(|p| {
            params! {
                "phone_no" => p.phone_no,
            }
        }),
    );
    res.expect("error batching");
    trans.commit().expect("error commit");
    

    Note however that you may run into concurrent modification exceptions if any of the rows touched by this transaction is modified by any other client, so splitting the exec_batch into smaller batches with data.chunks(1234) and adding commits and retrys after each batch may be necessary.

    Another approach I've used to speed up inserts for postgres is to prepare (e.g.) 14 statements like INSERT INTO … VALUES (row1), (row2), … for 1, 2, 4, 8, …, 8192 rows and then inserting your rows in chunks, using whatever the largest chunk is that fits the remaining data. I can't test this code right now, but it should look something like this:

    let preps = (0..13)
        .map(|i| {
            let mut query = "INSERT INTO listPhone (phone_no) VALUES (?)".to_owned();
            for _ in 1..(1 << i) {
                query.push_str(",(?)");
            }
            trans.prep(query)
        })
        .collect::<Result<Vec<_>, _>>()
        .expect("prepare");
    let mut data = &data[..];
    while !data.is_empty() {
        let prep = preps
            .iter()
            .rfind(|s| s.params().len() <= data.len())
            .unwrap();
        let chunk = data[..prep.params.len()]
            .iter()
            .map(|p| &p.phone_no)
            .collect::<Vec<_>>();
        trans.exec_iter(prep, chunk).expect("insert");
        data = &data[prep.params().len()..];
    }
    

    Whether this yields any further speed-up with mysql is unknown to me.