Search code examples
pythonpandasdataframezip

How to load downloaded and unzipped text files into a pandas dataframe?


The following code downloads and unzips a file containing thousands of text files

zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"
res = requests.get(zip_file_url, stream=True) # fazendo o request do dado
print("fazendo o download...")
z = zipfile.ZipFile(io.BytesIO(res.content))
print("extraindo os dados")
z.extractall("./")
print("ok..")

How can these files be loaded into a pandas dataframe?


Solution

    • See inline explanations of code
    • The code uses the pathlib module to find the files that have be unzipped
    • There are 20 article types, which means there are 20 keys in the dictionary of dataframes, dd.
    • The value of each key, is a dataframe, which contains all the articles for each article type.
      • Each dataframe has 1000 rows, 1 row for each article.
    • In total, there are 20000 articles.
    • This implementation will maintain the shape of the article.
      • When a row is printed from the dataframe, the article will be in a readable form with newlines and punctuation.
    • To create a single dataframe from the individual dataframes:
      • dfc = pd.concat(dd.values()).reset_index(drop=True)
      • This is why the 'type' column is added, when initially creating the dataframes. In a combined dataframe, the article type will be identifiable.
    • This answers the question, how to load all the files into a dataframe.
    • For further questions about processing the text, open a new question.
    from pathlib import Path
    from io import BytesIO
    import requests
    import pandas as pd
    from collections import defaultdict
    from zipfile import ZipFile
    
    ######################################################################
    # download and save zipped files
    
    # location to save files; this create a pathlib object of the path, and patlib objects have methods, like rglob, parts, and is_file
    save_path = Path('data/zipped')
    
    zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"
    res = requests.get(zip_file_url, stream=True)
    
    with ZipFile(BytesIO(res.content), 'r') as zip_ref:
        zip_ref.extractall(save_path)
    ######################################################################
    
    # find all the files; the methods in this list comprehension are pathlib methods
    files = [file for file in list(save_path.rglob('*')) if file.is_file()]
    
    # dict to save dataframes for each file
    dd = defaultdict(list)
    for file in files:
        
        # extract the type of article from the path
        article_type = file.parts[-2].replace('.', '_')
        
        # open the file
        with file.open(mode='r', encoding='utf-8', errors='ignore') as f:
            # read the lines and combine them into one string inside a list
            f = [' '.join([line for line in f.readlines() if line.strip()])]
            
        # create a dataframe from f
        df = pd.DataFrame(f, columns=['article'])
        
        # add a column for the article type
        df['type'] = article_type
        
        # add the dataframe to the default dict
        dd[article_type].append(df.copy())
    
    # each value of the dict is a list of dataframes, iterate through all keys and create a single dataframe for each key
    for k, v in dd.items():
        # for all the article type, combine all the dataframes into a single dataframe
        dd[k] = pd.concat(v).reset_index(drop=True)
    
    print(dd.keys())
    [out]:
    dict_keys(['alt_atheism', 'comp_graphics', 'comp_os_ms-windows_misc', 'comp_sys_ibm_pc_hardware', 'comp_sys_mac_hardware', 'comp_windows_x', 'misc_forsale', 'rec_autos', 'rec_motorcycles', 'rec_sport_baseball', 'rec_sport_hockey', 'sci_crypt', 'sci_electronics', 'sci_med', 'sci_space', 'soc_religion_christian', 'talk_politics_guns', 'talk_politics_mideast', 'talk_politics_misc', 'talk_religion_misc'])
    
    # print the first article for the alt_atheism key
    print(dd['alt_atheism'].iloc[0, 0])
    [out]:
    Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126
     Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew
     From: mathew <mathew@mantis.co.uk>
     Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers
     Subject: Alt.Atheism FAQ: Atheist Resources
     Summary: Books, addresses, music -- anything related to atheism
     Keywords: FAQ, atheism, books, music, fiction, addresses, contacts
     Message-ID: <19930329115719@mantis.co.uk>
     Date: Mon, 29 Mar 1993 11:57:19 GMT
     Expires: Thu, 29 Apr 1993 11:57:19 GMT
     Followup-To: alt.atheism
     Distribution: world
     Organization: Mantis Consultants, Cambridge. UK.
     Approved: news-answers-request@mit.edu
     Supersedes: <19930301143317@mantis.co.uk>
     Lines: 290
     Archive-name: atheism/resources
    ...