I have a CSV file with tweets having 4 columns (user_Id, status, tweet_Id, tweet_text) and more than 50,000 rows. The first column user_id has 4 unique Id's which are repeated throughout the column. The second column status is binary classification having either 0 or 1 for each tweet. The third column is tweet Id and fourth being the text of a tweet. For the first column.
The input file is already sorted based on two columns, first the tweet_id and then user_id.The file looks like this:
Sr#, user_id, status, tweet_id, tweet_text
1, 3712, 1, 444567, It is not easy to to do this you know...
2, 3713, 0, 444567, It is not easy to to do this you know...
3, 3714, 1, 444567, It is not easy to to do this you know...
4, 3715, 1, 444567, It is not easy to to do this you know...
5, 3712, 1, 444572, The process is yet to start
6, 3713, 0, 444572, The process is yet to start
7, 3714, 0, 444572, The process is yet to start
8, 3712, 1, 444580, I am betting on this
9, 3714, 0, 444580, I am betting on this
10, 3715, 0, 444580, I am betting on this
and so on.......
If we observe the first 4 rows, user_id value is different but tweet_id and text are the same. Similarly for rows no. 4, 5 and 6, user_id is different but tweet_id and text is the same.
I have to write a new CSV file in which for each tweet_id and text, all user id's of the first column (in this example 4) are created as new columns and for each user id column, the value of classification for that tweet which is status column is written under the new id column. If there is no status value for a worker_id, then the status value of that user_id is left blank.
The output file may look like this.
Sr#, tweet_text, tweet_id, 3712, 3713, 3714, 3715
1, It is not easy to to do this you know..., 444567, 1, 0, 1, 1
2, The process is yet to start, 444572, 1, 0, 0,
3, I am betting on this, 444580, 1, 0, 0
I tried the idea that whenever tweet_id changes, the tweet_id, tweet_text and four unique Id's status are written into new file. The code i used it given below:
import csv
import pandas as pd
with open('combined_csvFinalSortedClean2.csv', 'w', newline='') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
filewriter.writerow(['tweet_id','tweet_text', '3712', '3713', '3714', '3714'])
df = pd.read_csv('combined_csvFinalSortedClean2.csv', sep=',', header=None, index_col=False)
with open("combined_csvFinalSorted2.csv", "r", encoding="utf-8") as csv_file:
reader = csv.reader(csv_file, delimiter=',')
header = next(reader) # get header
curr_tweet=0
curr_wid=0
count=0
for row in reader:
wid=row[0]
id=row[2]
if (curr_tweet!=id) and (curr_wid!=wid):
curr_tweet=id
curr_wid=wid
count=1
df[0]=id
df[1]=row[3]
if wid==3712:
df[2]=row[2]
else:
df[2] = None
if wid==3713:
df[3]=row[2]
else:
df[3]= None
if wid==3714:
df[4]=row[2]
else:
df[4] = None
if wid==3715:
df[5]=row[2]
else:
df[5] = None
df.to_csv('output_file.csv', sep=',', encoding='utf-8', index=False)
count+=1
#else:
#None
#count+=1
I tried it but problem is that pandas' to_csv module is only writing the last row into the new output file and it's not writing anything into the four unique Id's columns based on given if...else conditions. I will appreciate some help.
Thanks.
Here's a way to do using pivot_table
:
newdf = (pd
.pivot_table(df,
index=['tweet_id','tweet_text'],
columns=['user_id'],
values='status',
fill_value=0)
.reset_index()
.rename({'user_id': 'sr'}))
print(newdf)
user_id tweet_id tweet_text 3712 \
0 444567 It is not easy to to do this you know... 1
1 444572 The process is yet to start 1
2 444580 I am betting on this 1
user_id 3713 3714 3715
0 0 1 1
1 0 0 0
2 0 0 0