Search code examples
pythonpandasdataframecsvdata-files

Reading data-file with new header in the middle of the file


I have a .txt data-file which contains a number of columns with different headers. I can read the file with all columns and rows. However, my problem is that the file contains an additional header with three columns, appended at the last row of the data from the initial header. How do I separate the three last columns from the first ones? Additionally I would like to remove the first of the three columns as it is a replica of the first column and append the other two columns column-wise to the columns at the top of the file. I have used pandas to read the file like this:

c = pd.read_csv('C:\filepath.txt', sep=',',header=None,names=['<Title1>','<Title2>','<Title3>','<Title4>','<Title5>','<Title6>','<Title7>','<Title8>','<Title9>','<Title10>','<Title11>','<Title12>'],skiprows=[0,1])

And the result is:

                         <Title1>  ... <Title12>
134849000   -0.420384078515376  ...    244.507248
135016000   -0.406915327374619  ...    244.507248
135183000   -0.406915327374619  ...    244.507248
135349000   -0.406915327374619  ...    244.507248
135516000   -0.406915327374619  ...    244.507248
...                        ...  ...           ... <-- (somewhere in here there is a new header with three columns)
2316226000   0.349323222511261  ...           NaN
2316393000   0.359268272664523  ...           NaN
2316560000   0.346797179431672  ...           NaN
2316726000   0.291363936474923  ...           NaN
2316893000   0.256587672540276  ...           NaN

[26188 rows x 12 columns] 

As can be seen the "4.th quandrant" (or column 4-12, row x, 1-indexed) of the data set contains NaN values because the three columns have been appended at last row of the first header and so they leave empty values because the file contains 12 columns from the top. In addition both headers have two lines in which the first is not needed, so I need to skip those lines.

Sample-file:

<Header1>
<Title1><Title2><Title3><Title4><Title5><Title6><Title7><Title8><Title9><Title10><Title11><Title12><Title13>
134849000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999938482075,-0.000223083188831434,-0.000166347560402173,3.08661080398315E-06,304.11793518,274.23748016,189.97101594,244.50724792
135016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999910346576,-0.000180534505822662,-0.000206991530844074,2.40981161937076E-06,304.0821228,274.15297698,189.97101594,244.50724792
135183000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999992511006,-0.000151940021895918,-0.000103313480817761,1.89050478219266E-06,304.0821228,274.15297698,189.97101594,244.50724792
135349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999945135159,-0.000162536174319313,-7.40562207892995E-05,2.04948428941809E-06,304.0821228,274.15297698,189.97101594,244.50724792
135516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135683000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999931122814,-0.000250245794219842,-0.000134729677676283,3.5093405085021E-06,304.0821228,274.15297698,189.97101594,244.50724792
136016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999952747184,-0.000248275760427849,-0.000209879516698194,3.49816745295883E-06,304.0821228,274.15297698,189.97101594,244.50724792
136183000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.99999992607031,-0.000294028840627048,-0.000210060717325711,4.25711234103981E-06,304.11793518,274.23748016,189.97101594,244.50724792
136349000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999919180309,-0.00029795985581717,-0.000124844955889991,4.29227691325224E-06,304.11935424,274.17742156,189.97101594,244.50724792
136516000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999888009148,-0.000316878274912839,-3.29402653026431E-05,4.57532859246546E-06,304.11793518,274.23748016,189.97101594,244.50724792
136683000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.11935424,274.17742156,189.97101594,244.50724792
136849000,-0.405802775661793,-0.444669714471277,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.0791626,274.18255616,189.97101594,244.50724792
137016000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.99999991055272,-0.00029252456348538,-0.000168782643050744,4.22385527217017E-06,304.11935424,274.17742156,189.97101594,244.50724792
137183000,-0.412309946883439,-0.450987020223235,53.3941583535493,3.94861381238115,0.999999942521442,-0.000255490185269549,-0.00024667166566595,3.6414759449141E-06,304.09646606,274.19935608,189.97101594,244.50724792
137349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999876479583,-0.000264577733448331,-0.000298287883815869,3.80576077658318E-06,304.0821228,274.15297698,189.97101594,244.50724792
137516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999903983449,-0.000251750438760731,-0.000355224963982992,3.60887866227011E-06,304.0821228,274.15297698,189.97101594,244.50724792
137683000,-0.391801749871831,-0.435460567656641,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.04193116,274.1580658,189.97101594,244.50724792
137849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.0821228,274.15297698,189.97101594,244.50724792
<Header2>
<Title13(same as Title 1)><Title14><Title15>
134849000,0.120862187115588,0
135016000,0.171543242833847,0
135183000,0.146335932645973,0
135349000,0.09773669641824,0
135516000,0.0882672298282907,0
135683000,0.124406962864472,0
135849000,0.186013875486258,0
136016000,0.219045896500945,0
136183000,0.197246332120462,0
136349000,0.150083583561413,0
136516000,0.0838562129822536,0
136683000,0.00269632558524612,0
136849000,-0.0447052988191479,0
137016000,-0.00496292706410619,0
137183000,0.0799457149607322,0
137349000,0.137388731956788,0
137516000,0.142305654943302,0
137683000,0.115943857754048,0
137849000,0.0991913228381935,0

Solution

  • I managed to find a relatively robust and simple solution for this specific dataset.

    after reading the data, and skipping the first header:

    raw_data = pd.read_csv('C:datafile.txt', sep=',',header=None, skiprows=[0,1])
    

    I check the first column for non-numerical values to find out where the next header is located:

    a = pd.to_numeric(pd.to_numeric(raw_data[0], errors='coerce').isnull())
    

    result:

    0        False
    1        False
    2        False
    3        False
    4        False
             ...  
    26183    False
    26184    False
    26185    False
    26186    False
    26187    False
    Name: 0, Length: 26188, dtype: bool
    

    then I locate the indexes where the statement is true:

    a = np.where(a)[0]
    

    result:

    [13093 13094]
    

    from here I can simply index the data for the two headers, using the indexes:

    d = raw_data.iloc[:raw_data.index.get_loc(a[0])]
    e = raw_data.iloc[raw_data.index.get_loc(a[0])+2:, :3]
    

    in e I also make sure to index the columns as we only have three columns for the second header

    result:

    d =

                   0                   1   ...          11          12
    0       134849000  -0.420384078515376  ...  189.971016  244.507248
    1       135016000  -0.406915327374619  ...  189.971016  244.507248
    2       135183000  -0.406915327374619  ...  189.971016  244.507248
    3       135349000  -0.406915327374619  ...  189.971016  244.507248
    4       135516000  -0.406915327374619  ...  189.971016  244.507248
    ...           ...                 ...  ...         ...         ...
    13088  2316226000   -0.30945361835179  ...  188.914284  243.942856
    13089  2316393000    -0.4099956694033  ...  188.914284  243.942856
    13090  2316560000    -0.4099956694033  ...  188.914284  243.942856
    13091  2316726000    -0.4099956694033  ...  188.914284  243.942856
    13092  2316893000  -0.429752713005517  ...  188.914284  243.942856
    
    [13093 rows x 13 columns]
    

    e =

                    0                   1  2
    13095   134849000   0.120862187115588  0
    13096   135016000   0.171543242833847  0
    13097   135183000   0.146335932645973  0
    13098   135349000    0.09773669641824  0
    13099   135516000  0.0882672298282907  0
    ...           ...                 ... ..
    26183  2316226000   0.349323222511261  0
    26184  2316393000   0.359268272664523  0
    26185  2316560000   0.346797179431672  0
    26186  2316726000   0.291363936474923  0
    26187  2316893000   0.256587672540276  0
    
    [13093 rows x 3 columns]
    

    As both data sets have a common column (the first column for each header) I use merge to append the bottom data set to the top one:

    f = pd.merge(d,e, on=[0,0])
    

    result:

                    0                 1_x  ...                 1_y 2_y
    0       134849000  -0.420384078515376  ...   0.120862187115588   0
    1       135016000  -0.406915327374619  ...   0.171543242833847   0
    2       135183000  -0.406915327374619  ...   0.146335932645973   0
    3       135349000  -0.406915327374619  ...    0.09773669641824   0
    4       135516000  -0.406915327374619  ...  0.0882672298282907   0
    ...           ...                 ...  ...                 ...  ..
    13088  2316226000   -0.30945361835179  ...   0.349323222511261   0
    13089  2316393000    -0.4099956694033  ...   0.359268272664523   0
    13090  2316560000    -0.4099956694033  ...   0.346797179431672   0
    13091  2316726000    -0.4099956694033  ...   0.291363936474923   0
    13092  2316893000  -0.429752713005517  ...   0.256587672540276   0
    
    [13093 rows x 15 columns]
    

    And now I have the proper data set which I can save, defining my own headers with .to_csv!