Search code examples
pythonpandaslistfiledirectory

Using os.listdir(os.path.join) with pandas series to obtain list of file in the folder from a variable


Task

In excel I have in cell A1=numero PPAS, A2= 1973.01, A3=1975.01, A4=1975.02 I use the cells A2, A3, A4 that are name of are folder "1973.01", "1975.01", "1975.02". I use them to access to the directories F:/Comune/Breggia_test/1973.01, F:/Comune/Breggia_test/1975.01, F:/Comune/Breggia_test/1975.02. For every directory I want the list of the files.

import pandas as pd
df = pd.read_excel (r'P:/Breggia_Tresa_ZP_test.xlsx')
y=df['numero PPAS']
print(y)

the result is the following:

0 1973.01

1 1975.01

2 1975.02

Name: numero PPAS, dtype: float64

Next step I transform the series into string and I remove the disturbing indexes (0, 1, 2) before the cell values.

for index, value in y.items():
    z=f" {index} : {value}"
    k=z[-7:]
    print(k)

The result is the following and it is a string (confirmed by type function not shown):

1973.01

1975.01

1975.02

I know that os.path.join accept only string and now it should be ok because of the above for loop with items function. Now I want to obtain three list of the files in 1973.01 (first iteration), 1975.01 (second iteration), 1975.01 (third iteration).

for item in k:
    item=os.listdir(os.path.join('F:/Comune/Breggia_test', k) )
    print(item)

but unfortunately the result is the list of F:\Comune\Breggia_test\1975.02 repeated for seven times, the same number of the caracter of the string created with k=z[-7:]:

['apm_19761129.pdf', 'apcst_19780823.pdf', 'apada_19771213.pdf']

['apm_19761129.pdf', 'apcst_19780823.pdf', 'apada_19771213.pdf']

['apm_19761129.pdf', 'apcst_19780823.pdf', 'apada_19771213.pdf']

['apm_19761129.pdf', 'apcst_19780823.pdf', 'apada_19771213.pdf']

['apm_19761129.pdf', 'apcst_19780823.pdf', 'apada_19771213.pdf']

['apm_19761129.pdf', 'apcst_19780823.pdf', 'apada_19771213.pdf']

['apm_19761129.pdf', 'apcst_19780823.pdf', 'apada_19771213.pdf']

The wished result had to be three list that came from the followin directories:

F:\Comune\Breggia_test\1973.01

F:\Comune\Breggia_test\1975.01

F:\Comune\Breggia_test\1975.02

Could someone explain what does not work?


Solution

  • I don't know if I understood what you're trying to do, but here's how you can take a Pandas Series, combine it with a base path, and list all directories inside:

    
    from pathlib import Path
    import pandas as pd
    
    
    # Path we'll be using as common base path.
    base_path = Path(r'/content/sample_data')
    
    # Our initial dataset. We'll be using `Pandas.Series`, and `pandas.DataFrame` common operation called:
    
    
    # `.asype` at the end of the next code block represents the conversion into strings.
    y = pd.Series([1973.01, 1975.01, 1975.02], name='PPAS').astype(str)
    
    
    

    Now, choose one of the following code, depending on what you want to retrieve.

    Option 1: Retrieve only the immediate files and directories

    Code:

    
    base_path = Path(r'/content/sample_data')
    
    list_of_subdirs = y.astype(str).apply(
        lambda value: [
            str(file) for file in base_path.joinpath(value).glob('*')]
    ).to_list()
    
    

    In my case, it returns:

    
    [['/content/sample_data/1973.01/README.md'],
     ['/content/sample_data/1975.01/california_housing_test.csv',
      '/content/sample_data/1975.01/1975.01',
      '/content/sample_data/1975.01/.ipynb_checkpoints'],
     ['/content/sample_data/1975.02/california_housing_train.csv']]
    
    

    Option 2: Retrieve only the immediate files

    
    base_path = Path(r'/content/sample_data')
    list_of_subdirs = y.astype(str).apply(
        lambda value: [str(file) for file in base_path.joinpath(value).glob('*') if file.is_file()]
    ).to_list()
    list_of_subdirs
    
    

    In my case, it returns:

    
    [['/content/sample_data/1973.01/README.md'],
     ['/content/sample_data/1975.01/california_housing_test.csv'],
     ['/content/sample_data/1975.02/california_housing_train.csv']]
    
    

    Option 3: Retrieve all subdirectories recursively

    
    base_path = Path(r'/content/sample_data')
    list_of_subdirs = y.astype(str).apply(
        lambda value: [str(file) for file in base_path.joinpath(value).glob('**/*')]
    ).to_list()
    list_of_subdirs
    
    

    In my case, it returns:

    
    [['/content/sample_data/1973.01/README.md'],
     ['/content/sample_data/1975.01/california_housing_test.csv',
      '/content/sample_data/1975.01/1975.01',
      '/content/sample_data/1975.01/.ipynb_checkpoints',
      '/content/sample_data/1975.01/1975.01/mnist_test.csv'],
     ['/content/sample_data/1975.02/california_housing_train.csv']]
    
    

    Option 4: Retrieve only files from all subdirectories

    
    base_path = Path(r'/content/sample_data')
    list_of_subdirs = y.astype(str).apply(
        lambda value: [str(file) for file in base_path.joinpath(value).glob('**/*') if file.is_file()]
    ).to_list()
    list_of_subdirs
    
    

    In my case, it returns:

    
    [
        ['/content/sample_data/1973.01/README.md'],
        [
            '/content/sample_data/1975.01/california_housing_test.csv',
            '/content/sample_data/1975.01/1975.01/mnist_test.csv'
        ],
        ['/content/sample_data/1975.02/california_housing_train.csv']
    ]
    
    

    For some additional context, here's a tree view of all the subdirectories:

    sample_data
    ├── 1973.01
    │   └── README.md
    ├── 1975.01
    │   ├── 1975.01
    │   │   └── mnist_test.csv
    │   └── california_housing_test.csv
    ├── 1975.02
    │   └── california_housing_train.csv
    ├── anscombe.json
    └── mnist_train_small.csv