Search code examples
pythonrlarge-files

Column selection for very large tab-delimited text files


I am working on a very large tab-delimited (about 20000 rows * 30000 columns) table in .txt form, which makes the file larger than 20GB. I checked the RAM of my desktop, it seems that there is no way I can read the file using the read.delim in R or the pandas in python directly. I am currently thinking about making a small subset of the table based on a filter of the columns, and the new text file is going to be 20000 rows * 1200 columns. I have saved the column names of my target files in another file already, and I guess my next step is just to check whether the colname of a given column in the original file is on my new list and subset the data frame based on that. My question is how could I do this filtering most concisely?

I know there is a chunking method in pandas, but it seems to me that it is more about cutting the data frame based on a particular number of rows. The bash script I had been working on was also about a slicing based on rows. I would be really appreciable if someone could suggest a concise way of filtering the large data frame based on the column name instead of the row names/numbers. If possible, I was more expecting a solution based on python (R or bash is also welcomed).


Solution

  • Assuming the column you want is the 5th column, just process the file one line at a time as shown below.

    output = list()
    with open('yourfile.txt', 'r') as _f:
        for line in _f:
            output.append(line.split(',')[4]) 
    print(output)