Search code examples
rustrust-polarspolars

Reading a float array from file and converting it to integer


I am wondering about the equivalent to the following Python code in Rust:

import numpy as np
import pandas as pd

X = pd.read_csv('./myfile.tsv', sep='\t')
X1 = (X > 0).astype(np.float64).T
X2 = X1.to_numpy()

I've seen that polars can be used as Rust equivalent of pandas, but perhaps there is a better way of doing it, since no data frame manipulation is intended (the rest of my Python code operates with a numpy array, and pandas is used just as a convenient way of parsing a tsv file).

Related
How can I create an array from a CSV column encoded as a string using Polars in Rust? (not answered)
How to read local csv file as ndarray using Rust (seems to be the standard way (see also here) - but it seems very verbose - perhaps the code provided does a lot more than what I need.


Solution

  • From the discussion in the comments, I assume that you use the following input file:

    0   1   2
    1.0279113360087446  -1.2832284218816041 -0.9511599763983775
    -1.1854242089984073 -0.008517913446124657   -1.3300888479137685
    -0.17521484641409052    -0.12088194195850789    -0.08723124550308935
    0.061450180456309234    0.6382691829655216  -0.3221706205270814
    -0.17264583969234573    0.3906165503608199  -0.7023512952269605
    -0.5688680458505405 0.7629597902952466  0.1591487223247267
    -0.2866576739505336 0.8416529504197675  -0.21334731046185212
    -0.3731653844853498 -0.03664374978977539    1.0659217203299267
    0.2522037897994046  -1.2283963325279825 0.582406079711755
    1.066724230889717   -0.630727285387302  0.9536582516683726
    0.629243230148583   -0.6960460436000655 0.4521748684016147
    -1.5540598822950011 0.9873241509921236  0.6415246342947979
    -0.0284157295982256 -0.18702110687157966    1.7770271811904519
    1.2382847672051143  -0.3760108519487906 -0.16110341746476323
    -0.2808874342459878 0.6487504756926984  1.9778474878186199
    -0.37522505843289716    1.7209367591622693  -0.19706519630516914
    -0.33530410802770294    -0.04999186121022599    -0.675375947654844
    -2.0252464624551956 -0.27944625298143044    1.385051832284722
    1.2957606931360681  0.7365431841643268  1.3572525489892076
    -1.3877762318274933 1.166958182611204   0.685506702653605
    

    Which, combined with this code:

    import numpy as np
    import pandas as pd
    
    X = pd.read_csv('./myfile.tsv', sep='\t')
    X1 = (X > 0).astype(np.float64).T
    X2 = X1.to_numpy()
    
    print(X2)
    

    produces the following output:

    [[1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
     [0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1.]
     [0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1.]]
    

    Algorithmically, I assume:

    • You read a CSV with tab separators whilst ignoring the first row (this is what pd.read_csv(.., sep='\t') seems to do)
    • You convert the data into 1.0 or 0.0, depending on whether the value is larger than zero ((X > 0).astype(np.float64))
    • You transpose the data (.T)

    And I assume you want the data to be stored somewhat efficiently, so that not every row/column is its own object.


    That all said, there are many ways to achieve this in Rust. But the fact that you use numpy and pandas in Python shows me that you probably want to mostly base your code on existing high level data manipulation libraries and not implement stuff yourself.

    Although this choice in Python could be for performance reasons; iterating through data with loops is highly inefficient in Python. Be aware that in Rust, a manually written for loop to manipulate the data will have similar performance to using libraries, because Rust objects and primitives have very little overhead compared to their Python counterparts.

    There are a couple of high-level Rust crates that fulfill similar functions as your Python libraries. For data deserialization, I recommend serde and its implementations, in this case probably csv.

    Then for data representation, I recommend ndarray.

    You seem to have found those two already yourself, but I just wanted to confirm that those two are good choices.


    Here is a possible Rust equivalent of your Python code.

    Dependencies (in Cargo.toml):

    [dependencies]
    csv = "1.3.1"
    ndarray = "0.16.1"
    ndarray-csv = "0.5.3"
    

    Code:

    use ndarray_csv::Array2Reader;
    
    fn main() {
        let arr = csv::ReaderBuilder::new() // Configure your own CSV reader (required because tab separated)
            .delimiter(b'\t') // Specify tab separated
            .from_path("./myfile.tsv") // Open file
            .expect("Unable to open input file!") // Handle file open error
            .deserialize_array2_dynamic::<f64>() // Deserialize as f64 (f64 is the equivalent to Python floats, so I assume you want this)
            .expect("Unable to parse file content!") // Handle file parsing error
            .mapv_into(|val| if val > 0.0 { 1.0 } else { 0.0 }) // Perform the `X > 0` conversion
            .reversed_axes(); // Transpose in-place without actually copying any data
    
        println!("{:?}", arr); // Debug print
    }
    
    [[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],
     [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0],
     [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0]], shape=[3, 20], strides=[1, 3], layout=Ff (0xa), const ndim=2