I have a text dataset. the content of the text is looks like as follows.
.I 1\n.T\nPreliminary Report-International Algebraic Language\n.B\nCACM December,
.I 2\n.T\nExtraction of Roots by,5\t3\n .I 3\n.T\nTechniquI 4\n.T\nGlossary of Computer
this is the description of the dataset
.I 1, I.2, .I 3 -> are the document id and the rest of the text is the content of the document. the task is: to create a list of tuples-> [(doc_id, content)]. Any help or suggestion is highly appreciated!
FILENAME = "your filename"
f = open(FILENAME)
lst = []
lines = f.read().splitlines()
for i in range(0,len(lines),3):
lst.append((lines[i],lines[i+1]))
print(lst)
->>>[('.i 1', 'lipsum lipsum lipsum lipsum B. lipsum lipsum '), ('.i 2', 'lipsum lipsum '), ('.i 3', 'lipsum lipsum lipsum lipsum')]