I have a txt file with the following format(simplified):
date this that other
2007-05-25 11:00:00 10 20 30
2007-05-25 11:10:00 15 18 30
2007-05-25 11:20:00 10 27 30
2007-05-25 11:30:00 20 35 30
2007-05-25 11:50:00 30 20
2007-05-25 12:00:00 30 13
2007-05-25 12:10:00 30 13
The first raw is strings defining what is the column above them. The first column it is clear that is time. It can be observed also that some values are missing. I do not want to erase the rows that some values are missing. As i want to make some calculations with that data later I thought to use numpy to import that data by using numpy.loadtxt
:
data = numpy.loadtxt('data.txt')
It gives an error ValueError: could not convert string to float: b'date'
due to the first raw. Using:
data = numpy.genfromtxt('data.txt')
gives an error Line #51028 (got 38 columns instead of 37)
for many lines which is because some values are missing. What should i try?
Pandas is a NumPy-based library. Among many other things, it was made to work well with incomplete data.
You should be able to install pandas with a simple:
$ pip install pandas
I saved your example file under http://pastebin.com/NuNaTW9n and replaced the spaces between the columns with tabs.
>>> import pandas as pd
>>> from urllib import urlopen
>>> df = pd.read_csv(urlopen("http://pastebin.com/raw.php?i=NuNaTW9n"), sep='\t')
>>> df
date this that other
0 2007-05-25 11:00:00 10 20 30
1 2007-05-25 11:10:00 15 18 30
2 2007-05-25 11:20:00 10 27 30
3 2007-05-25 11:30:00 20 30 NaN
4 2007-05-25 11:50:00 30 20 NaN
5 2007-05-25 12:00:00 30 13 NaN
6 2007-05-25 12:10:00 30 13 NaN
Once you have a handle on a data frame, you can start to explore your data:
>>> df["this"].sum()
145
>>> df["that"].mean()
20.142857142857142
>>> df[df["that"] < 20]["date"]
1 2007-05-25 11:10:00
5 2007-05-25 12:00:00
6 2007-05-25 12:10:00
By default, pandas will try to guess the best data type for your values (e.g. it will guess that df["that"]
should be an int64
), but you can control this behavior by passing a dtype
argument to read_csv
.