Search code examples
pythondataframepreprocessor

Create DataFrame from raw input


I am getting data as follows:-

$0011:0524-08-2021
$0021:0624-08-2021
&0011:0724-08-2021
&0021:0924-08-2021
$0031:3124-08-2021
&0031:3224-08-2021
$0041:3924-08-2021
&0041:3924-08-2021
$0012:3124-08-2021
&0012:3324-08-2021

In $0011:0524-08-2021, $ denotes start of string, 001 denotes ID, 1:05 denotes time, 24-08-2021 denotes date. Similarly &0011:0624-08-2021 everything is same except & denotes end of string.

Taking the above data I want to create a data frame as follows:-

1. $0011:0524-08-2021   &0011:0724-08-2021
2. $0021:0624-08-2021   &0021:0924-08-2021
3. $0031:3124-08-2021   &0031:3224-08-2021
4. $0041:3924-08-2021   &0041:3924-08-2021
5. $0012:3124-08-2021   &0012:3324-08-2021

Basically I want to sort the entries into a data frame as shown above. There are few conditions that must be satisfied in doing so.

1.) Column1 should have only entries of $ and Column2 should have only & entries.

2.) Both the columns should be arranged in increasing order of time. Column1 with $ entries should be arranged in increasing order of time and same goes for column2 with & entries.


Solution

  • If you're getting the lines as you shown in your example, you can try:

    import pandas as pd
    
    
    def process_lines(lines):
        buffer = {}
        for line in map(str.strip, lines):
            id_ = line[1:4]
            if line[0] == "$":
                buffer[id_] = line
            elif line[0] == "&" and buffer.get(id_):
                yield buffer[id_], line
                del buffer[id_]
    
    
    txt = """$0011:0524-08-2021
    $0021:0624-08-2021
    &0011:0724-08-2021
    &0021:0924-08-2021
    $0031:3124-08-2021
    &0031:3224-08-2021
    $0041:3924-08-2021
    &0041:3924-08-2021
    $0012:3124-08-2021
    &0012:3324-08-2021"""
    
    df = pd.DataFrame(process_lines(txt.splitlines()), columns=["A", "B"])
    print(df)
    

    Prints:

                        A                   B
    0  $0011:0524-08-2021  &0011:0724-08-2021
    1  $0021:0624-08-2021  &0021:0924-08-2021
    2  $0031:3124-08-2021  &0031:3224-08-2021
    3  $0041:3924-08-2021  &0041:3924-08-2021
    4  $0012:3124-08-2021  &0012:3324-08-2021