Search code examples
pythonarrayspandasnumpygenfromtxt

Convert genfromtxt array to regular numpy array


I can't post the data being imported, because it's too much. But, it has both number and string fields and is 5543 rows and 137 columns. I import data with this code (ndnames and ndtypes holds the column names and column datatypes):

npArray2 = np.genfromtxt(fileName, 
                        delimiter="|", 
                        skip_header=1, 
                        dtype=(ndtypes), 
                        names=ndnames, 
                        usecols=np.arange(0,137)
                        )

This works and the resulting variable type is "void7520" with size (5543,). But this is really a 1D array of 5543 rows, where each element holds a sub-array that has 137 elements. I want to convert this into a normal numpy array of 5543 rows and 137 columns. How can this be done?

I have tried the following (using Pandas):

pdArray = pd.read_csv(fileName, 
                      sep=ndelimiter,
                      index_col=False, 
                      skiprows=1,
                      names=ndnames
                      )
npArray = pd.DataFrame.as_matrix(pdArray)

But, the resulting npArray is type Object with size (5543,137) which, at first, looks promising. But, because it's type Object, there are other functions that can't be performed on it. Can this Object array be converted into a normal numpy array?

Edit: ndtypes look like... [int,int,...,int,'|U50',int,...,int,'|U50',int,...,int] That is, 135 number fields with two string-type fields in the middle somewhere.


Solution

  • npArray2 is a 1d structured array, 5543 elements and 137 fields.

    What does npArray2.dtype look like, or equivalently what is ndtypes, because the dtype is built from the types and names that you provided. "void7520" is a way of identifying a record of this array, but tells us little except the size (in bytes?).

    If all fields of the dtype are numeric, even better yet if they are all the same numeric dtype (int, float), then it is fairly easy to convert it to a 2d array with 137 columns (2nd dim). astype and view can be used.

    (edit - it has both number and string fields - you can't convert it to a 2d array of numbers; it could be an array of strings, but you can't do numeric math on strings.)

    But if the dtypes are mixed then you can't convert it. All elements of the 2d array have be the same dtype. You have to use the structured array approach if you want mixed types. (well there is the dtype=object, but let's not go there).

    Actually pandas is going the object route. Evidently it thinks the only way to make an array from this data is to let each element be its own type. And the math of object arrays is severely limited. They are, in effect a glorified, or debased, list.