Search code examples
pythonpandascsvaudiowav

How can I use commas inside a csv column to separate multiple floating point values?


I have a question that is closely related to this question here:

How to convert .wav files into a Pandas DataFrame in order to feed it to a neural network?

I have created a pandas DataFrame with the following code:

df = pd.DataFrame(data={"wavsamples": pd.Series(wavsamples), "wavsamplerate": pd.Series(wavsamplerate), "wavname": pd.Series(wavname)}, copy=False, columns = ['wavsamples','wavsamplerate','wavname'])
df.index.name = 'filenumber'

If I print the second column inside my pandas DataFrame

with

print(df.wavsamples.to_string(index=False))

it shows me the pandas series 'wavsamples' that looks like this:

[0.02709961, 0.06796265, -0.011810303, -0.23361...
[0.0068969727, 0.04547119, 0.043029785, -0.1025...
[-0.005432129, 0.021057129, 0.078063965, 0.0270...
[0.00079345703, 0.064941406, 0.09710693, -0.088...
[-0.0067749023, 0.008087158, 0.06536865, 0.0219...
[-0.008758545, 0.015106201, 0.08139038, 0.02600...
[-0.0034179688, 0.039733887, 0.07711792, 0.1164...
[-0.0008087158, -0.000579834, -0.00062561035, -...
[0.021026611, 0.029907227, 0.040527344, 0.05448...
[0.017288208, 0.026321411, 0.0340271, 0.0403137...
[0.019561768, 0.026611328, 0.03668213, 0.047576...
[0.022827148, 0.03414917, 0.056289673, 0.078018...

Each of these 12 rows represents the raw floating point sample values of a .wav file. Now if I write these arrays inside a column of a CSV-file with:

df.to_csv("./test.csv", sep=',', columns = ['wavsamples','wavsamplerate','wavname'])

I get the following csv file:

filenumber,wavsamples,wavsamplerate,wavname
0,"[ 0.02709961  0.06796265 -0.0118103  ... -0.36627197 -0.36645508
 -0.3657837 ]",44100,Audio1.wav
1,"[ 0.00689697  0.04547119  0.04302979 ... -0.03359985 -0.03244019
 -0.03167725]",44100,Audio2.wav
2,"[-0.00543213  0.02105713  0.07806396 ...  0.45645142  0.45541382
  0.45510864]",44100,Audio3.wav
3,[0.00079346 0.06494141 0.09710693 ... 0.22116089 0.22421265 0.22741699],44100,Audio4.wav
4,"[-0.0067749   0.00808716  0.06536865 ...  0.24209595  0.23977661
  0.23754883]",44100,Audio5.wav
5,"[-0.00875854  0.0151062   0.08139038 ... -0.0256958  -0.0184021
 -0.01156616]",44100,Audio6.wav
6,"[-0.00341797  0.03973389  0.07711792 ...  0.41384888  0.41375732
  0.41348267]",44100,Audio7.wav
7,"[-0.00080872 -0.00057983 -0.00062561 ...  0.0100708   0.0100708
  0.01000977]",44100,Audio8.wav
8,[0.02102661 0.02990723 0.04052734 ... 0.00976562 0.00965881 0.00990295],44100,Audio9.wav
9,[0.01728821 0.02632141 0.0340271  ... 0.01344299 0.01341248 0.01325989],44100,Audio10.wav
10,[0.01956177 0.02661133 0.03668213 ... 0.0141449  0.01400757 0.01402283],44100,Audio11.wav
11,[0.02282715 0.03414917 0.05628967 ... 0.01019287 0.01037598 0.01025391],44100,Audio12.wav

So the column 'wavsamples' lost all of its commas. If I now read and print the column from the csv file with:

with open("./test.csv", "r") as csv_file:
    reader = csv.reader(csv_file)
    rows = list(reader)
    audiofile = rows[12][1]
    print(audiofile)

I just get:

[0.02282715 0.03414917 0.05628967 ... 0.01019287 0.01037598 0.01025391]

Not only have all the commas been removed, but as the wavsamples column gets treated like a character string the three dots get mistaken as literal dot characters so all the sample values in between get lost when writing them into the csv ...

I know that csv is possibly the worst format to store .wav data like pointed out a lot of times here on stack overflow ... but I'm just curious - is there any way to store audio arrays with commas between the floating point values inside a csv column?

I want to get a result like this when I read something from the csv:

[0.022827148, 0.03414917, 0.056289673, 0.078018...

Instead of this:

[0.02282715 0.03414917 0.05628967 ... 0.01019287 0.01037598 0.01025391]

How could I write the csv column so that I could read it correctly afterwards?


Solution

  • The CSV format will not support list types in a column, you need scalar values. What happens here is that pandas will implicitly cast that column containing the list type to a string. It has nothing to do with your chosen delimiter.

    One possible way to handle this if you have to have CSV format is to parse it back to a list type using ast.literal_eval, to be applied across that column, when you read the data back in.

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'a': [[1, 2], [2, 3], [3, 4]],
                       'b': [4, 5, 6]})
    print(df.head())
    
    df.to_csv('nested_test.csv', index=False)
    
    df = pd.read_csv('nested_test.csv')
    print(df.head()
    
    for _, row in df.iterrows():
        # Note that, though it *looked* like a list in df.head()
        # we just get [ printed, as the first character of the
        # string it actually is
        print(row['a'][0]) 
          
    import ast  
    df['a'] = df['a'].apply(ast.literal_eval)
    
    for _, row in df.iterrows():
        print(row['a'][0]) # Now we get the first item in the list
    

    If you used polars instead of pandas, this implicit cast would not be allowed and it would throw an exception. This is despite the fact that it has a List type as a first-class citizen. For this kind of data, you really should be looking into a format such as parquet, which is not only many times faster to parse in, but will natively handle the nested structure of your column(s).

    Finally, in your question, you specify using the csv module to read the data back in. You can do this, but I don't suppose it's particularly elegant, given the restrictions on CSV that I mentioned. This works for the example I gave, which assumes that all other non-list columns will be int, otherwise you'll need to handle them one-by-one.

    import csv
    
    with open('nested_test.csv') as infile:
        reader = csv.reader(infile)
        headers = next(reader)
        rebuilt = []
        for row in reader:
            rebuilt.extend([ast.literal_eval(row[0]), *map(int, row[1:])])
        print(rebuilt)
    

    Just to complicate things further, you don't actually have lists in your column but actually np.ndarray objects. When they get converted to strings, you lose the commas from __repr__ on top of the other complications.

    arr = np.array([1., 2., 3.])
    print(arr)
    

    Save yourself an extra headache by using:

    df['a'] = df['a'].apply(np.ndarray.tolist)
    

    before df.to_csv()

    ... you might now be seeing why CSV is not a great format here ...