Search code examples
pythonpandasstringbeautifulsoup

Convert string to dataframe after extracting using BeautifulSoup


import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from io import StringIO

url = "https://www.tickertape.in/stocks/oil-and-natural-gas-corporation-ONGC"
r = requests.get(url=url)#,headers=headers)

soup = bs(r.content,'html5lib')
fin = soup.find_all(class_="financials-table-root")
for f in fin:
    str_data = f.text
    break

print(str_data)
df = pd.read_csv(StringIO(str_data))

print(df)

This is not giving me the desired result. I am not good with handing strings, please guide me a way to extract the values from the str_data, so that I can use it for further calculations.

Output of str_data:

Financial YearFY 2016FY 2017FY 2018FY 2019FY 2020FY 2021FY 2022FY 2023FY 2024TTMTotal Revenue1,33,084.682,95,230.293,33,135.154,32,783.974,06,772.743,15,162.695,00,202.156,41,531.266,57,654.976,57,751.89Raw Materials36,666.931,75,376.732,02,298.512,75,523.612,69,975.372,05,913.122,25,616.932,66,120.045,41,837.995,42,659.83Power & Fuel Cost1,182.381,957.932,177.502,139.732,078.921,980.442,239.162,174.44Employee Cost9,230.1715,128.1614,970.7215,850.5015,531.2614,135.1215,235.7414,898.79Selling & Administrative Expenses35,735.3136,378.9034,331.5839,621.4429,357.6623,004.0836,347.6144,365.47Operating & Other expenses10,068.00860.7212,038.3915,262.6036,742.869,385.991,34,092.392,38,475.37EBITDA40,201.8965,527.8567,318.4584,386.0953,086.6760,743.9486,670.3275,497.151,15,816.981,15,092.06Depreciation/Amortization16,384.0620,219.2023,111.9123,703.7026,634.8825,538.4726,883.1624,557.0628,762.7428,750.79PBIT23,817.8345,308.6544,206.5460,682.3926,451.7935,205.4759,787.1650,940.0987,054.2486,341.27Interest & Other Items3,765.583,591.114,999.045,836.737,489.345,079.035,696.047,889.3610,194.1710,194.17PBT20,052.2541,717.5439,207.5054,845.6618,962.4530,126.4454,091.1243,050.7376,860.0776,147.10Taxes & Other Items7,177.0417,298.2917,101.5824,299.628,158.8513,822.058,568.997,610.2827,638.6927,459.24Net Income12,875.2124,419.2522,105.9230,546.0410,803.6016,304.3945,522.1335,440.4549,221.3848,687.86EPS10.0319.0317.2324.048.5912.9636.1928.1739.1338.70DPS5.677.556.607.005.003.6010.5011.2512.2510.25Payout ratio0.560.400.380.290.580.280.290.400.310.26

Output of print(df):

Empty DataFrame
Columns: [Financial YearFY 2016FY 2017FY 2018FY 2019FY 2020FY 2021FY 2022FY 2023FY 2024TTMTotal Revenue1, 33, 084.682, 95, 230.293, 33.1, 135.154, 32, 783.974, 06, 772.743, 15, 162.695, 00, 202.156, 41, 531.266, 57, 654.976, 57.1, 751.89Raw Materials36, 666.931, 75, 376.732, 02, 298.512, 75.1, 523.612, 69, 975.372, 05, 913.122, 25, 616.932, 66, 120.045, 41.1, 837.995, 42, 659.83Power & Fuel Cost1, 182.381, 957.932, 177.502, 139.732, 078.921, 980.442, 239.162, 174.44Employee Cost9, 230.1715, 128.1614, 970.7215, 850.5015, 531.2614, 135.1215, 235.7414, 898.79Selling & Administrative Expenses35, 735.3136, 378.9034, 331.5839, 621.4429, 357.6623, 004.0836, 347.6144, 365.47Operating & Other expenses10, 068.00860.7212, 038.3915, 262.6036, 742.869, 385.991, 34, 092.392, 38, 475.37EBITDA40, 201.8965, 527.8567, 318.4584, 386.0953, 086.6760, 743.9486, 670.3275, 497.151, 15.1, 816.981, 15.2, 092.06Depreciation/Amortization16, 384.0620, 219.2023, 111.9123, 703.7026, 634.8825, 538.4726, 883.1624, 557.0628, 762.7428, 750.79PBIT23, 817.8345, 308.6544, 206.5460, 682.3926, 451.7935, ...]
Index: []

Trying to make a dataframe to use the values.


Solution

  • You can use read_html with fixing columns names by remove Unnamed columns names and filter by length of them:

    url = "https://www.tickertape.in/stocks/oil-and-natural-gas-corporation-ONGC"
    
    
    df = pd.read_html(url)[2]
    
    cols = [c for c in df.columns if not c.startswith('Unnamed')]
    out = df.iloc[:, :len(cols)].set_axis(cols, axis=1)
    

    print (out)
                           Financial Year    FY 2016    FY 2017    FY 2018    FY 2019    FY 2020    FY 2021    FY 2022    FY 2023    FY 2024        TTM
    0                       Total Revenue  133084.68  295230.29  333135.15  432783.97  406772.74  315162.69  500202.15  641531.26  657654.97  657751.89
    1                       Raw Materials   36666.93  175376.73  202298.51  275523.61  269975.37  205913.12  225616.93  266120.04  541837.99  542659.83
    2                   Power & Fuel Cost    1182.38    1957.93    2177.50    2139.73    2078.92    1980.44    2239.16    2174.44  541837.99  542659.83
    3                       Employee Cost    9230.17   15128.16   14970.72   15850.50   15531.26   14135.12   15235.74   14898.79  541837.99  542659.83
    4   Selling & Administrative Expenses   35735.31   36378.90   34331.58   39621.44   29357.66   23004.08   36347.61   44365.47  541837.99  542659.83
    5          Operating & Other expenses   10068.00     860.72   12038.39   15262.60   36742.86    9385.99  134092.39  238475.37  541837.99  542659.83
    6                              EBITDA   40201.89   65527.85   67318.45   84386.09   53086.67   60743.94   86670.32   75497.15  115816.98  115092.06
    7           Depreciation/Amortization   16384.06   20219.20   23111.91   23703.70   26634.88   25538.47   26883.16   24557.06   28762.74   28750.79
    8                                PBIT   23817.83   45308.65   44206.54   60682.39   26451.79   35205.47   59787.16   50940.09   87054.24   86341.27
    9              Interest & Other Items    3765.58    3591.11    4999.04    5836.73    7489.34    5079.03    5696.04    7889.36   10194.17   10194.17
    10                                PBT   20052.25   41717.54   39207.50   54845.66   18962.45   30126.44   54091.12   43050.73   76860.07   76147.10
    11                Taxes & Other Items    7177.04   17298.29   17101.58   24299.62    8158.85   13822.05    8568.99    7610.28   27638.69   27459.24
    12                         Net Income   12875.21   24419.25   22105.92   30546.04   10803.60   16304.39   45522.13   35440.45   49221.38   48687.86
    13                                EPS      10.03      19.03      17.23      24.04       8.59      12.96      36.19      28.17      39.13      38.70
    14                                DPS       5.67       7.55       6.60       7.00       5.00       3.60      10.50      11.25      12.25      10.25
    15                       Payout ratio       0.56       0.40       0.38       0.29       0.58       0.28       0.29       0.40       0.31       0.26
    

    For select by indicators is possible set first column to index:

    df = pd.read_html(url, index_col=0)[2]
    
    cols = [c for c in df.columns if not c.startswith('Unnamed')]
    out = df.iloc[:, :len(cols)].set_axis(cols, axis=1)
    

    print (out)
    
                                         FY 2016    FY 2017    FY 2018    FY 2019    FY 2020    FY 2021    FY 2022    FY 2023    FY 2024        TTM
    Financial Year                                                                                                                                 
    Total Revenue                      133084.68  295230.29  333135.15  432783.97  406772.74  315162.69  500202.15  641531.26  657654.97  657751.89
    Raw Materials                       36666.93  175376.73  202298.51  275523.61  269975.37  205913.12  225616.93  266120.04  541837.99  542659.83
    Power & Fuel Cost                    1182.38    1957.93    2177.50    2139.73    2078.92    1980.44    2239.16    2174.44  541837.99  542659.83
    Employee Cost                        9230.17   15128.16   14970.72   15850.50   15531.26   14135.12   15235.74   14898.79  541837.99  542659.83
    Selling & Administrative Expenses   35735.31   36378.90   34331.58   39621.44   29357.66   23004.08   36347.61   44365.47  541837.99  542659.83
    Operating & Other expenses          10068.00     860.72   12038.39   15262.60   36742.86    9385.99  134092.39  238475.37  541837.99  542659.83
    EBITDA                              40201.89   65527.85   67318.45   84386.09   53086.67   60743.94   86670.32   75497.15  115816.98  115092.06
    Depreciation/Amortization           16384.06   20219.20   23111.91   23703.70   26634.88   25538.47   26883.16   24557.06   28762.74   28750.79
    PBIT                                23817.83   45308.65   44206.54   60682.39   26451.79   35205.47   59787.16   50940.09   87054.24   86341.27
    Interest & Other Items               3765.58    3591.11    4999.04    5836.73    7489.34    5079.03    5696.04    7889.36   10194.17   10194.17
    PBT                                 20052.25   41717.54   39207.50   54845.66   18962.45   30126.44   54091.12   43050.73   76860.07   76147.10
    Taxes & Other Items                  7177.04   17298.29   17101.58   24299.62    8158.85   13822.05    8568.99    7610.28   27638.69   27459.24
    Net Income                          12875.21   24419.25   22105.92   30546.04   10803.60   16304.39   45522.13   35440.45   49221.38   48687.86
    EPS                                    10.03      19.03      17.23      24.04       8.59      12.96      36.19      28.17      39.13      38.70
    DPS                                     5.67       7.55       6.60       7.00       5.00       3.60      10.50      11.25      12.25      10.25
    Payout ratio                            0.56       0.40       0.38       0.29       0.58       0.28       0.29       0.40       0.31       0.26