Search code examples
pythonpandascsvheader

pandas read .csv from url, starting row has fewer header


I want to download a .csv file from this website(to directly download the csv, here). The problem I'm facing is, the row where i want to start importing has fewer columns than than rows in the later part, and I just cannot figure out how to read into pandas.

Indeed, this csv file is quite not beautiful.

enter image description here

Here is how I want to import the csv in pandas:

  1. Ignore the first row where there are "Trade Date"

  2. Separate data frame between sections(using for loop, separate wherever there is a blank row)

  3. Store JPX Code(such as 16509005) and Instrument(such as FUT_TOPIX_2009) in additional columns.

  4. Set headers ['institutions_sell_code', 'institutions_sell', 'institutions_sell_eng', 'amount_sell', 'institutions_buy_code', 'institutions_buy', 'institutions_buy_eng', 'amount_buy', 'JPX_code', 'instrument']

So the expected outcome will be:

enter image description here

Here is my try. I first tried to read the whole data into pandas:

import io
import pandas as pd
import requests
url = 'https://www.jpx.co.jp/markets/derivatives/participant-volume/nlsgeu000004vd5b-att/20200730_volume_by_participant_whole_day_J-NET.csv'
s=requests.get(url).content
colnames = ['institutions_sell_code', 'institutions_sell', 'institutions_sell_eng', 'amount_sell', 'institutions_buy_code', 'institutions_buy', 'institutions_buy_eng', 'amount_buy']
df=pd.read_csv(io.StringIO(s.decode('utf-8')), header=1, names = colnames)

ParserError: Error tokenizing data. C error: Expected 2 fields in line 6, saw 8

I assume this is because the header=1 has just two columns whereas other rows have eight. In fact when I set header=2 to exclude JPX Code and Instrument, it works. So how can I include the row with JPX Code and Instrument?

enter image description here


Solution

  • Pandas does not really support multiple documents in one CSV file like you have. What I have done to solve this, which worked fine, takes two steps:

    1. Call read_csv(usecols=[0]) once to read the leftmost column. Use this to determine where each table starts and ends.
    2. Open the file using open() just once, and for each table determined in step 1, call read_csv(skiprows=SKIP, nrows=ROWS) with appropriate values to read one table at a time. This is the key: by only letting Pandas read the properly rectangular rows, it will not become angry at the unhygienic nature of your CSV file.

    Opening the file just once is an optimization, to avoid scanning it over and over every time you execute step 2. You can actually use the same opened file object for step 1 as well, if you seek(0) to return to the beginning before beginning step 2.