I have the following code:
from numpy import genfromtxt
nysedatafile = os.getcwd() + '/nyse.txt';
nysedata = genfromtxt(nysedatafile, delimiter='\t', names=True, dtype=None);
nasdaqdatafile = os.getcwd() + '/nasdaq.txt';
nasdaqdata = genfromtxt(nasdaqdatafile, delimiter='\t', names=True, dtype=None);
Now I would like to merge the data from the 2 CSVs and I tried various functions:
For example:
import numpy as np;
alldata = np.array(np.concatenate((nysedata, nasdaqdata)));
print('NYSE stocks:' + str(nysedata.shape[0]));
print('NASDAQ stocks:' + str(nasdaqdata.shape[0]));
print('ALL stocks:' + str(alldata.shape[0]));
returns:
TypeError: invalid type promotion
I tried as well numpy.vstack
and to try to call an array on it.
I expect the last print to give the sum of the rows of the two previous csv files.
EDIT: This command:
print('NYSE shape:' + str(nysedata.shape));
print('NASDAQ shape:' + str(nasdaqdata.shape));
print('NYSE dtype:' + str(nysedata.dtype));
print('NASDAQ dtype:' + str(nasdaqdata.dtype));
returns:
NYSE shape:(3257,)
NASDAQ shape:(2719,)
NYSE dtype:[('Symbol', 'S14'), ('Name', 'S62'), ('LastSale', 'S9'), ('MarketCap', '<f8'), ('ADR_TSO', 'S3'), ('IPOyear', 'S4'), ('Sector', 'S21'), ('industry', 'S62'), ('Summary_Quote', 'S38')]
NASDAQ dtype:[('Symbol', 'S14'), ('Name', 'S62'), ('LastSale', 'S7'), ('MarketCap', '<f8'), ('ADR_TSO', 'S3'), ('IPOyear', 'S4'), ('Sector', 'S21'), ('industry', 'S62'), ('Summary_Quote', 'S34')]
The reason why np.vstack
(or np.concatenate
) is raising an error is because the dtypes of the two arrays do not match.
Notice the very last field: ('Summary_Quote', 'S38')
versus ('Summary_Quote', 'S34')
. nysedata's Summary_Quote
column is 38 bytes long, while nasdaqdata
's column is only 34 bytes long.
(Edit: The LastSale
column suffers a similar problem.)
This happened because genfromtxt
guesses the dtype of the columns when the dtype = None
parameter is set. For string columns, genfromtxt
determines the minimum number of bytes needed to contain
all the strings in that column.
So to stack the two arrays, the smaller one has to be promoted to the larger one's dtype:
import numpy.lib.recfunctions as recfunctions
recfunctions.stack_arrays([nysedata,nasdaqdata.astype(nysedata.dtype)], usemask = False)
(My previous answer used np.vstack. This results in a 2-dimensional array of shape (N,1). recfunctions.stack_arrays
returns a 1-dimensional array of shape (N,). Since nysedata
and nasdaqdata
are 1-dimensional, I think it is better to return a 1-dimensional array too.)
Possibly an easier solution would be to concatenate the two csv files first and then call genfromtxt
:
import numpy as np
import os
cwd = os.getcwd()
nysedatafile = os.path.join(cwd, 'nyse.txt')
nasdaqdatafile = os.path.join(cwd, 'nasdaq.txt')
alldatafile = os.path.join(cwd, 'all.txt')
with open(nysedatafile) as f1, open(nasdaqdatafile) as f2, open(alldatafile, 'w') as g:
for line in f1:
g.write(line)
next(f2)
for line in f2:
g.write(line)
alldata = np.genfromtxt(alldatafile, delimiter='\t', names=True, dtype=None)