Using os.listdir(os.path.join) with pandas series to obtain list of file in the folder from a variable

Task

In excel I have in cell A1=numero PPAS, A2= 1973.01, A3=1975.01, A4=1975.02 I use the cells A2, A3, A4 that are name of are folder "1973.01", "1975.01", "1975.02". I use them to access to the directories F:/Comune/Breggia_test/1973.01, F:/Comune/Breggia_test/1975.01, F:/Comune/Breggia_test/1975.02. For every directory I want the list of the files.

import pandas as pd
df = pd.read_excel (r'P:/Breggia_Tresa_ZP_test.xlsx')
y=df['numero PPAS']
print(y)

the result is the following:

0 1973.01

1 1975.01

2 1975.02

Name: numero PPAS, dtype: float64

Next step I transform the series into string and I remove the disturbing indexes (0, 1, 2) before the cell values.

for index, value in y.items():
    z=f" {index} : {value}"
    k=z[-7:]
    print(k)

The result is the following and it is a string (confirmed by type function not shown):

1973.01

1975.01

1975.02

I know that os.path.join accept only string and now it should be ok because of the above for loop with items function. Now I want to obtain three list of the files in 1973.01 (first iteration), 1975.01 (second iteration), 1975.01 (third iteration).

for item in k:
    item=os.listdir(os.path.join('F:/Comune/Breggia_test', k) )
    print(item)

but unfortunately the result is the list of F:\Comune\Breggia_test\1975.02 repeated for seven times, the same number of the caracter of the string created with k=z[-7:]:

['apm_19761129.pdf', 'apcst_19780823.pdf', 'apada_19771213.pdf']

The wished result had to be three list that came from the followin directories:

F:\Comune\Breggia_test\1973.01

F:\Comune\Breggia_test\1975.01

F:\Comune\Breggia_test\1975.02

Could someone explain what does not work?

Solution

I don't know if I understood what you're trying to do, but here's how you can take a Pandas Series, combine it with a base path, and list all directories inside:


from pathlib import Path
import pandas as pd


# Path we'll be using as common base path.
base_path = Path(r'/content/sample_data')

# Our initial dataset. We'll be using `Pandas.Series`, and `pandas.DataFrame` common operation called:


# `.asype` at the end of the next code block represents the conversion into strings.
y = pd.Series([1973.01, 1975.01, 1975.02], name='PPAS').astype(str)

Now, choose one of the following code, depending on what you want to retrieve.

Option 1: Retrieve only the immediate files and directories

Code:


base_path = Path(r'/content/sample_data')

list_of_subdirs = y.astype(str).apply(
    lambda value: [
        str(file) for file in base_path.joinpath(value).glob('*')]
).to_list()

In my case, it returns:


[['/content/sample_data/1973.01/README.md'],
 ['/content/sample_data/1975.01/california_housing_test.csv',
  '/content/sample_data/1975.01/1975.01',
  '/content/sample_data/1975.01/.ipynb_checkpoints'],
 ['/content/sample_data/1975.02/california_housing_train.csv']]

Option 2: Retrieve only the immediate files


base_path = Path(r'/content/sample_data')
list_of_subdirs = y.astype(str).apply(
    lambda value: [str(file) for file in base_path.joinpath(value).glob('*') if file.is_file()]
).to_list()
list_of_subdirs

In my case, it returns:


[['/content/sample_data/1973.01/README.md'],
 ['/content/sample_data/1975.01/california_housing_test.csv'],
 ['/content/sample_data/1975.02/california_housing_train.csv']]

Option 3: Retrieve all subdirectories recursively


base_path = Path(r'/content/sample_data')
list_of_subdirs = y.astype(str).apply(
    lambda value: [str(file) for file in base_path.joinpath(value).glob('**/*')]
).to_list()
list_of_subdirs

In my case, it returns:


[['/content/sample_data/1973.01/README.md'],
 ['/content/sample_data/1975.01/california_housing_test.csv',
  '/content/sample_data/1975.01/1975.01',
  '/content/sample_data/1975.01/.ipynb_checkpoints',
  '/content/sample_data/1975.01/1975.01/mnist_test.csv'],
 ['/content/sample_data/1975.02/california_housing_train.csv']]

Option 4: Retrieve only files from all subdirectories


base_path = Path(r'/content/sample_data')
list_of_subdirs = y.astype(str).apply(
    lambda value: [str(file) for file in base_path.joinpath(value).glob('**/*') if file.is_file()]
).to_list()
list_of_subdirs

In my case, it returns:


[
    ['/content/sample_data/1973.01/README.md'],
    [
        '/content/sample_data/1975.01/california_housing_test.csv',
        '/content/sample_data/1975.01/1975.01/mnist_test.csv'
    ],
    ['/content/sample_data/1975.02/california_housing_train.csv']
]

For some additional context, here's a tree view of all the subdirectories:

sample_data
├── 1973.01
│   └── README.md
├── 1975.01
│   ├── 1975.01
│   │   └── mnist_test.csv
│   └── california_housing_test.csv
├── 1975.02
│   └── california_housing_train.csv
├── anscombe.json
└── mnist_train_small.csv