Search code examples
pythonpandascsvtwitter

how to create and rearrange a new csv file based on existing file meeting certain conditions?


I have a CSV file with tweets having 4 columns (user_Id, status, tweet_Id, tweet_text) and more than 50,000 rows. The first column user_id has 4 unique Id's which are repeated throughout the column. The second column status is binary classification having either 0 or 1 for each tweet. The third column is tweet Id and fourth being the text of a tweet. For the first column.

The input file is already sorted based on two columns, first the tweet_id and then user_id.The file looks like this:

  Sr#,       user_id,     status,      tweet_id,                 tweet_text

   1,         3712,          1,         444567,       It is not easy to to do this you know...

   2,         3713,          0,         444567,       It is not easy to to do this you know...

   3,         3714,          1,         444567,       It is not easy to to do this you know...

   4,         3715,          1,         444567,       It is not easy to to do this you know...

   5,         3712,          1,         444572,       The process is yet to start

   6,         3713,          0,         444572,       The process is yet to start

   7,         3714,          0,         444572,       The process is yet to start

   8,         3712,          1,         444580,       I am betting on this

   9,         3714,          0,         444580,       I am betting on this

  10,         3715,          0,         444580,       I am betting on this

    and so on.......

If we observe the first 4 rows, user_id value is different but tweet_id and text are the same. Similarly for rows no. 4, 5 and 6, user_id is different but tweet_id and text is the same.

I have to write a new CSV file in which for each tweet_id and text, all user id's of the first column (in this example 4) are created as new columns and for each user id column, the value of classification for that tweet which is status column is written under the new id column. If there is no status value for a worker_id, then the status value of that user_id is left blank.

The output file may look like this.

Sr#,         tweet_text,                        tweet_id,    3712,    3713,    3714,   3715

1,    It is not easy to to do this you know...,  444567,       1,       0,       1,     1

2,    The process is yet to start,               444572,       1,       0,       0,

3,    I am betting on this,                      444580,       1,                0,     0

I tried the idea that whenever tweet_id changes, the tweet_id, tweet_text and four unique Id's status are written into new file. The code i used it given below:

 import csv
 import pandas as pd

 with open('combined_csvFinalSortedClean2.csv', 'w', newline='') as csvfile:
   filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
   filewriter.writerow(['tweet_id','tweet_text', '3712', '3713', '3714', '3714'])

 df = pd.read_csv('combined_csvFinalSortedClean2.csv', sep=',', header=None, index_col=False)

 with open("combined_csvFinalSorted2.csv", "r", encoding="utf-8") as csv_file:
   reader = csv.reader(csv_file, delimiter=',')
   header = next(reader) # get header
   curr_tweet=0
   curr_wid=0
   count=0

   for row in reader:
     wid=row[0]
     id=row[2]

     if (curr_tweet!=id) and (curr_wid!=wid):
      curr_tweet=id
      curr_wid=wid
      count=1
      df[0]=id
      df[1]=row[3]

     if wid==3712:
       df[2]=row[2]
     else: 
       df[2] = None

     if wid==3713:
       df[3]=row[2]
     else: 
       df[3]= None

     if wid==3714:
       df[4]=row[2]
     else: 
       df[4] = None

     if wid==3715:
       df[5]=row[2]
     else: 
       df[5] = None

     df.to_csv('output_file.csv', sep=',', encoding='utf-8', index=False)
     count+=1

     #else:
       #None
       #count+=1

I tried it but problem is that pandas' to_csv module is only writing the last row into the new output file and it's not writing anything into the four unique Id's columns based on given if...else conditions. I will appreciate some help.

Thanks.


Solution

  • Here's a way to do using pivot_table:

    newdf = (pd
            .pivot_table(df, 
                  index=['tweet_id','tweet_text'], 
                  columns=['user_id'], 
                  values='status', 
                  fill_value=0)
            .reset_index()
            .rename({'user_id': 'sr'}))
    
    print(newdf)
    
    user_id  tweet_id                                       tweet_text  3712  \
    0          444567         It is not easy to to do this you know...     1   
    1          444572                      The process is yet to start     1   
    2          444580                             I am betting on this     1   
    
    user_id  3713  3714  3715  
    0           0     1     1  
    1           0     0     0  
    2           0     0     0