Search code examples
pythonpandasdataframedata-extraction

Extracting a row value from one file, and putting that value to another row in another file (with filename corresponds to row of the previous file)


I have a list of CSV filenames (which is in another CSV file called CSV_file_1). However, I want to put additional two columns in CSV_file_1, in which the row values will come from the thousands of individual CSV files.

CSV_file_1 contents are as follows:

1.csv
2.csv
3.csv

In the thousands of files I have in another folder, it contains values that I want to put in CSV_file_1. Example, 1.csv contains the following rows:

LATITUDE : ;13.63345
LONGITUDE : ;123.207083  

2.csv contains the following rows:

LATITUDE : ;13.11111
LONGITUDE : ;123.22222 

3.csv contains the following rows:

LATITUDE : ;13.22222
LONGITUDE : ;123.11111 

and so on.

The result that I want to have for CSV_file_1 is as follows:

FILENAME:      LATITUDE:     LONGITUDE:
    1.csv      13.63345      123.207083
    2.csv      13.11111      123.22222
    3.csv      13.22222      123.11111

I already managed to have my CSV_file_1 but without the LATITUDE AND LONGITUDE yet (which will come from individual files which are delimited as shown above).

My code is like this:

import pandas as pd

import glob

print(glob.glob("D:/2021/*.csv")) 

#list of all the filenames collated and put in CSV_file_1
CSV_file_1 = pd.DataFrame(glob.glob("D:/2021/*.csv")) 


 #creating blank columns in CSV_file_1
CSV_file_1 ['Latitude'] = ""
CSV_file_1 ['Longitude'] = ""

#here im trying to access each file in the given folder(file name must correspond to the row in CSV_file_1), extract the data (latitude and longitude) and copy it to CSV_file_1
 import csv
 with open('D:/2021/*.csv','rt')as file:
      data = csv.reader(file)
      for row in file:
            if glob.glob("D:/2021/*.csv") = CSV_file_1['FILENAME']:
                CSV_file_1.iloc[i] ['LATITUDE:'] ==file.iloc[i]
        
        
        
    CSV_file_1.to_csv('D:/2021/CSV_file_1.csv', index = False)

but I get invalid syntax.

 if glob.glob("D:/2021/*.csv") = CSV_file_1['FILENAME']:
            ^
SyntaxError: invalid syntax

I am a python newbie so I would like to seek help to fix my code.


Solution

  • If I understand your problem correctly I think your approach is a little bit complex. I implemented a script that is creating the desired output.

    First, the CSV file with the names of the other files is read directly into the first column of the data frame. Then, the file names are used to extract the longitude and latitude from each file. For this, I created a function, which you can see in the first part of the script. In the end, I add the extracted values to the data frame and store it in a file in the desired format.

    import pandas as pd
    import csv
    
    # Function that takes 
    def get_lati_and_long_from_csv(csv_path):
        with open(csv_path,'rt') as file:
            # Read csv file content to list of rows
            data = list(csv.reader(file, delimiter =';'))
            
            # Take values from row zero and one
            latitude = data[0][1]
            longitude = data[1][1]
          
            
            return (latitude, longitude)
    
    def main():      
        # Define path of first csv file
        csv_file_1_path = "CSV_file_1.csv"
    
        # Read data frame from csv file and create correct column name
        CSV_file_1 = pd.read_csv(csv_file_1_path, header=None)
        CSV_file_1.columns = ['FILENAME:']
        
        # Create list of files to read the coordinates
        list_of_csvs = list(CSV_file_1['FILENAME:'])
    
        # Define empty lists to add the coordinates
        lat_list = []
        lon_list = []
        
        # Iterate over all csv files and extract longitude and latitude
        for csv_path in list_of_csvs:
            lat, lon = get_lati_and_long_from_csv(csv_path)
            lat_list.append(lat)
            lon_list.append(lon)
            
        # Add coordinates to the data frame
        CSV_file_1['Latitude:'] = lat_list
        CSV_file_1['Longitude:'] = lon_list
     
        # Save final data frame to csv file
        CSV_file_1.to_csv(csv_file_1_path+'.out', index = False, sep='\t')
        
    if __name__ == "__main__":
        main()
    

    Test input file content:

    1.csv
    2.csv
    3.csv
    

    Test output file content:

    FILENAME:   Latitude:   Longitude:
    1.csv   13.63345    123.207083  
    2.csv   13.11111    123.22222 
    3.csv   13.22222    123.11111
    

    EDIT: If your files do not contain any other data, I would suggest simplifying things and removing pandas as it is not needed. The following main() function produces the same result but uses only the CSV module.

    def main():      
        # Define path of first csv file
        csv_file_1_path = "CSV_file_1.csv"
    
        # Read file to list containing the paths of the other csv files
        with open(csv_file_1_path,'rt') as file:
            list_of_csvs = file.read().splitlines()
            
        print(list_of_csvs)
        # Define empty lists to add the coordinates
        lat_list = []
        lon_list = []
        
        # Iterate over all csv files and extract longitude and latitude
        for csv_path in list_of_csvs:
            lat, lon = get_lati_and_long_from_csv(csv_path)
            lat_list.append(lat)
            lon_list.append(lon)
        
        # Combine the three different lists to create the rows of the new csv file
        data = list(zip(list_of_csvs, lat_list, lon_list))
        
        # Create the headers and combine them with the other rows
        rows = [['FILENAME:', 'Latitude:', 'Longitude:']]
        rows.extend(data)
        
        # Write everything to the final csv file
        with open(csv_file_1_path + '.out','w') as file:
            csv_writer = csv.writer(file, dialect='excel', delimiter='\t')
            csv_writer.writerows(rows)