Search code examples
pythonpandasdataframeexport-to-csvkey-value

Convert a text file with different key value pairs to a csv file


I have a text file as shown below in the sample that I want to convert to a csv file (currently using Pandas).
The challenge is that I don't know in advance what the keys (column headers are) and in what order they are.
The final column order is not important.

Sample file:

name: john| dob: 10-06-1960| address: 4853 Radio Park Drive
name: jane| dob: 07-10-1973| address: 1537 Timbercrest Road| mobile: 706-289-6746
name: liam| dob: 12-08-1986| address: 4853 498 Fairmont Avenue| telephone number: 706-687-5021
name: chris| dob: 09-12-1965| address: 485 Green Avenue| state: California| Telephone Number: 510-855-5213

Desired output:

Name | dob        | address                  | mobile       | telephone number | state      |
-----+------------+--------------------------+--------------+------------------+------------+
john | 10-06-1960 | 4853 Radio Park Drive    |              |                  |            |
jane | 07-10-1973 | 1537 Timbercrest Road    | 706-289-6746 |                  |            |
liam | 12-08-1986 | 4853 498 Fairmont Avenue |              | 706-687-5021     |            |
chris| 09-12-1965 | 485 Green Avenue         |              | 510-855-5213     | California |

My Code:

import pandas as pd

df = pd.DataFrame()

file = open('D:\sample.log', 'r')
        
lines = file.readlines()

for line in lines:

    pairs = line.split('|')

    my_dict = {}

    for pair in pairs:
       key = pair.split(': ')[0].strip()
       value = pair.split(': ')[1].strip()
            
       my_dict[key] = value

       df.append(my_dict, ignore_index=True)

This way of appending is very slow. How can I make this faster.

Or is there a much better solution (for example via a json string)?


Solution

  • TL;DR:

    pd.DataFrame.from_records(
        dict(field.split(': ') for field in line.split('|'))
        for line in lines
    )
    

    Long Version

    Assuming you already split your data into lines you then need to process them into records such as:

    {' address': '4853 Radio Park Drive', ' dob': '10-06-1960', 'name': 'john'}
    

    Each line needs to be split into fields:

    >>> line = 'name: john| dob: 10-06-1960| address: 4853 Radio Park Drive'
    >>> line.split('|')
    ['name: john', ' dob: 10-06-1960', ' address: 4853 Radio Park Drive']
    

    Then each field needs to be split into the name of the column and the value itself:

    >>> field = 'name: John'
    >>> field.split(': ')
    ['name', 'john']
    

    Once you do this for every field in the line you end up with a list of these:

    >>> [field.split(': ') for field in line.split('|')]
    [['name', 'john'],
     [' dob', '10-06-1960'],
     [' address', '4853 Radio Park Drive']]
    

    A dictionary initialised with this list gets you the record from the beginning of the answer.

    Since you have many lines, you need to produce many records but it's better to produce these lazily, in other words using a generator:

    >>> (dict(field.split(': ') for field in line.split('|')) for line in s.split('\n'))
    <generator object <genexpr> at 0x7f0d06bf8dd0>
    

    Rather than producing you a whole list of records, the generator gives you one at a time when you iterate over it. This way you can start forming your dataframe without having to wait for all the records to be processed.

    There is a special syntax in Python called generator comprehension that let's you define generators to be passed as an argument to functions and constructors.

    Putting it all together, we construct a dataframe using the appropriate constructor (from_records) and the generator defined above:

    pd.DataFrame.from_records(
        dict(field.split(': ') for field in line.split('|'))
        for line in lines
    )
    

    This produces the following output:

        name         dob                   address        mobile  telephone number       state  Telephone Number
    0   john  10-06-1960     4853 Radio Park Drive           NaN               NaN         NaN               NaN
    1   jane  07-10-1973     1537 Timbercrest Road  706-289-6746               NaN         NaN               NaN
    2   liam  12-08-1986  4853 498 Fairmont Avenue           NaN      706-687-5021         NaN               NaN
    3  chris  09-12-1965          485 Green Avenue           NaN               NaN  California      510-855-5213
    

    As a bonus, you can speed this up further by reading the file lazily too. Define a custom generator for reading lines:

    def lines(path):
      with open(path) as file:
        while line := file.readline():
          yield line.rstrip()
    

    Note this will only work with Python 3.8+. Otherwise, instead of using the walrus operator you need to do this instead:

    def lines(path):
      with open(path) as file:
        while True:
          line = file.readline()
          if line:
            yield line.rstrip()
          else:
            return