How to load downloaded and unzipped text files into a pandas dataframe?

The following code downloads and unzips a file containing thousands of text files

zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"
res = requests.get(zip_file_url, stream=True) # fazendo o request do dado
print("fazendo o download...")
z = zipfile.ZipFile(io.BytesIO(res.content))
print("extraindo os dados")
z.extractall("./")
print("ok..")

How can these files be loaded into a pandas dataframe?

Solution

See inline explanations of code
The code uses the pathlib module to find the files that have be unzipped
There are 20 article types, which means there are 20 keys in the dictionary of dataframes, dd.
The value of each key, is a dataframe, which contains all the articles for each article type.
- Each dataframe has 1000 rows, 1 row for each article.
In total, there are 20000 articles.
This implementation will maintain the shape of the article.
- When a row is printed from the dataframe, the article will be in a readable form with newlines and punctuation.
To create a single dataframe from the individual dataframes:
- dfc = pd.concat(dd.values()).reset_index(drop=True)
- This is why the 'type' column is added, when initially creating the dataframes. In a combined dataframe, the article type will be identifiable.
This answers the question, how to load all the files into a dataframe.
For further questions about processing the text, open a new question.

from pathlib import Path
from io import BytesIO
import requests
import pandas as pd
from collections import defaultdict
from zipfile import ZipFile

######################################################################
# download and save zipped files

# location to save files; this create a pathlib object of the path, and patlib objects have methods, like rglob, parts, and is_file
save_path = Path('data/zipped')

zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"
res = requests.get(zip_file_url, stream=True)

with ZipFile(BytesIO(res.content), 'r') as zip_ref:
    zip_ref.extractall(save_path)
######################################################################

# find all the files; the methods in this list comprehension are pathlib methods
files = [file for file in list(save_path.rglob('*')) if file.is_file()]

# dict to save dataframes for each file
dd = defaultdict(list)
for file in files:
    
    # extract the type of article from the path
    article_type = file.parts[-2].replace('.', '_')
    
    # open the file
    with file.open(mode='r', encoding='utf-8', errors='ignore') as f:
        # read the lines and combine them into one string inside a list
        f = [' '.join([line for line in f.readlines() if line.strip()])]
        
    # create a dataframe from f
    df = pd.DataFrame(f, columns=['article'])
    
    # add a column for the article type
    df['type'] = article_type
    
    # add the dataframe to the default dict
    dd[article_type].append(df.copy())

# each value of the dict is a list of dataframes, iterate through all keys and create a single dataframe for each key
for k, v in dd.items():
    # for all the article type, combine all the dataframes into a single dataframe
    dd[k] = pd.concat(v).reset_index(drop=True)

print(dd.keys())
[out]:
dict_keys(['alt_atheism', 'comp_graphics', 'comp_os_ms-windows_misc', 'comp_sys_ibm_pc_hardware', 'comp_sys_mac_hardware', 'comp_windows_x', 'misc_forsale', 'rec_autos', 'rec_motorcycles', 'rec_sport_baseball', 'rec_sport_hockey', 'sci_crypt', 'sci_electronics', 'sci_med', 'sci_space', 'soc_religion_christian', 'talk_politics_guns', 'talk_politics_mideast', 'talk_politics_misc', 'talk_religion_misc'])

# print the first article for the alt_atheism key
print(dd['alt_atheism'].iloc[0, 0])
[out]:
Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126
 Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew
 From: mathew <mathew@mantis.co.uk>
 Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers
 Subject: Alt.Atheism FAQ: Atheist Resources
 Summary: Books, addresses, music -- anything related to atheism
 Keywords: FAQ, atheism, books, music, fiction, addresses, contacts
 Message-ID: <19930329115719@mantis.co.uk>
 Date: Mon, 29 Mar 1993 11:57:19 GMT
 Expires: Thu, 29 Apr 1993 11:57:19 GMT
 Followup-To: alt.atheism
 Distribution: world
 Organization: Mantis Consultants, Cambridge. UK.
 Approved: news-answers-request@mit.edu
 Supersedes: <19930301143317@mantis.co.uk>
 Lines: 290
 Archive-name: atheism/resources
...