Search code examples
pythondataframetext-mining

Extract specific numbers from txt files and insert them into a data frame


Good evening, I would need some help extracting two numbers from a text file and inserting them into a dataframe.

This is my txt file:

PING 10.0.12.100 (10.0.12.100) 56(84) bytes of data.
64 bytes from 10.0.12.100: icmp_seq=1 ttl=59 time=0.094 ms
64 bytes from 10.0.12.100: icmp_seq=2 ttl=59 time=0.070 ms
64 bytes from 10.0.12.100: icmp_seq=3 ttl=59 time=0.076 ms
64 bytes from 10.0.12.100: icmp_seq=4 ttl=59 time=0.075 ms
64 bytes from 10.0.12.100: icmp_seq=5 ttl=59 time=0.070 ms
64 bytes from 10.0.12.100: icmp_seq=6 ttl=59 time=0.060 ms
64 bytes from 10.0.12.100: icmp_seq=7 ttl=59 time=0.093 ms
64 bytes from 10.0.12.100: icmp_seq=8 ttl=59 time=0.080 ms
64 bytes from 10.0.12.100: icmp_seq=9 ttl=59 time=0.077 ms
64 bytes from 10.0.12.100: icmp_seq=10 ttl=59 time=0.082 ms
64 bytes from 10.0.12.100: icmp_seq=11 ttl=59 time=0.070 ms
64 bytes from 10.0.12.100: icmp_seq=12 ttl=59 time=0.075 ms
64 bytes from 10.0.12.100: icmp_seq=13 ttl=59 time=0.087 ms
64 bytes from 10.0.12.100: icmp_seq=14 ttl=59 time=0.069 ms
64 bytes from 10.0.12.100: icmp_seq=15 ttl=59 time=0.072 ms
64 bytes from 10.0.12.100: icmp_seq=16 ttl=59 time=0.079 ms
64 bytes from 10.0.12.100: icmp_seq=17 ttl=59 time=0.096 ms
64 bytes from 10.0.12.100: icmp_seq=18 ttl=59 time=0.071 ms

--- 10.0.12.100 ping statistics ---
18 packets transmitted, 18 received, 0% packet loss, time 17429ms
rtt min/avg/max/mdev = 0.060/0.077/0.096/0.013 ms

I would like to have a dataframe like this:

enter image description here

This is my code:

import pandas as pd

df = pd.DataFrame(columns=["ICMP_SEQ", "TIME"])

with open("/content/H11H22_ping.txt", "r") as f:
  txt = f.read() 
  print(txt)

  // code

  df = df.append({"ICMP_SEQ": icmp_seq, "TIME": time})

Thanks


Solution

  • Use str.extract:

    df = pd.read_csv('/content/H11H22_ping.txt', skiprows=1, header=None, names=['logs'])
    res = df['logs'].str.extract(r'icmp_seq=(?P<icmp_seq>\d+)\b.+\btime=(?P<time>\d+\.\d+)', expand=True)
    print(res)
    

    Output (partial)

       icmp_seq   time
    0         1  0.094
    1         2  0.070
    2         3  0.076
    3         4  0.075
    4         5  0.070
    5         6  0.060
    6         7  0.093
    7         8  0.080
    8         9  0.077
    9        10  0.082
    10       11  0.070
    11       12  0.075
    12       13  0.087
    13       14  0.069
    14       15  0.072
    15       16  0.079
    16       17  0.096
    17       18  0.071
    ...