I want to download a .csv file from this website(to directly download the csv, here). The problem I'm facing is, the row where i want to start importing has fewer columns than than rows in the later part, and I just cannot figure out how to read into pandas.
Indeed, this csv file is quite not beautiful.
Here is how I want to import the csv in pandas:
Ignore the first row where there are "Trade Date"
Separate data frame between sections(using for loop, separate wherever there is a blank row)
Store JPX Code(such as 16509005) and Instrument(such as FUT_TOPIX_2009) in additional columns.
Set headers ['institutions_sell_code', 'institutions_sell', 'institutions_sell_eng', 'amount_sell', 'institutions_buy_code', 'institutions_buy', 'institutions_buy_eng', 'amount_buy', 'JPX_code', 'instrument']
So the expected outcome will be:
Here is my try. I first tried to read the whole data into pandas:
import io
import pandas as pd
import requests
url = 'https://www.jpx.co.jp/markets/derivatives/participant-volume/nlsgeu000004vd5b-att/20200730_volume_by_participant_whole_day_J-NET.csv'
s=requests.get(url).content
colnames = ['institutions_sell_code', 'institutions_sell', 'institutions_sell_eng', 'amount_sell', 'institutions_buy_code', 'institutions_buy', 'institutions_buy_eng', 'amount_buy']
df=pd.read_csv(io.StringIO(s.decode('utf-8')), header=1, names = colnames)
ParserError: Error tokenizing data. C error: Expected 2 fields in line 6, saw 8
I assume this is because the header=1 has just two columns whereas other rows have eight. In fact when I set header=2
to exclude JPX Code and Instrument, it works. So how can I include the row with JPX Code and Instrument?
Pandas does not really support multiple documents in one CSV file like you have. What I have done to solve this, which worked fine, takes two steps:
read_csv(usecols=[0])
once to read the leftmost column. Use this to determine where each table starts and ends.open()
just once, and for each table determined in step 1, call read_csv(skiprows=SKIP, nrows=ROWS)
with appropriate values to read one table at a time. This is the key: by only letting Pandas read the properly rectangular rows, it will not become angry at the unhygienic nature of your CSV file.Opening the file just once is an optimization, to avoid scanning it over and over every time you execute step 2. You can actually use the same opened file object for step 1 as well, if you seek(0)
to return to the beginning before beginning step 2.