I have a text file that I am trying to categorize based on the number of lines between the lines with word 'START' and 'END /'. I/p files structure:
START
Action1
Action2
Action3
END /
START
Action1
END /
START
Action1
Action2
END /
START
Action0
Action1
Action2
Action3
END /
START
Action1
END /
The code should detect the number of lines between 'START' and 'END /' and categorize in the following manner: if only 1 action line then 'P1' ; if more than one action line then 'P2'
So the output of the depicted i/p file can be given as:
['P2', 'P1', 'P2', 'P2', 'P1']
The end goal is to export this output list into an excel column (as shown). I believe this can be done with help of pandas library, however, any suggestions for the same will be appreciated.
Category
P2
P1
P2
P2
P1
Initially I am able to print out the corresponding line number for the entire file, so was also thinking of extracting the line numbers. However, there was a major flaw to idea since the number of Actions lines vary.
with open('filepath.txt') as f:
for index, line in enumerate(f):
print("Line {}: {}".format(index, line.strip()))
initial flawed idea output:
Line 0:
Line 1: A
Line 2: Action1
Line 3: Action2
Line 4: Action3
Line 5: B
Line 6:
Line 7: A
Line 8: Action1
Line 9: B
Line 10:
Line 11: A
Line 12: Action1
Line 13: Action1
Line 14: B
Line 15:
Line 16: A
Line 17: Action0
Line 18: Action1
Line 19: Action2
Line 20: Action3
Line 21: B
Then I came up with the idea of detecting the initial (START) and final (END) pattern , count the lines in between and with if else statement can assign P1 or P2 category. Currently stuck on implementing a way to count lines within the pattern.
Any help with the code will be helpful, thank you!
If the file data is exactly what you mentioned in your question then the following code should work.
import pandas as pd
result = []
fp = 'your_file.txt' # change this
with open(fp) as file:
file_content = file.read().splitlines()
count = 0
# this is the logic you were after:
for item in file_content:
if item.strip() == 'START':
count = 0
elif item.strip() == 'END /':
if count <= 1:
result.append('P1')
else:
result.append('P2')
else:
count += 1
print(result)
dataframe = pd.DataFrame(result, columns=['Category'])
# Note: Pandas module needs openpyxl module installed for this next step
dataframe.to_excel('excel.xlsx', index=False)