I have a Market Matrix file, which I have to use for carrying out text analyses.
The market file has the following structure:
%%MatrixMarket matrix coordinate integer general
2000 5000 23000
1 4300 1
1 2200 1
1 3000 1
1 600 1
The values in the second lines indicate the number of rows, number of columns, and total number of non-zero values in the matrix. All lines after this contain 3 values:
As read in many posts I read this file, using scipy.io.mmread and the new API for dealing with parse data structure.
In particular, I used the following code:
Matrix = (mmread('file_name.mtx'))
B = Matrix.todense()
df = pd.DataFrame(B)
print(df.head())
However, from this code I got a data frame indexed from 0:
0 1 2 3 4 5 6 7 8 9 ... 4872 \
0 1 0 1 0 0 0 0 0 1 0 ... 0
1 0 0 0 0 0 0 0 0 0 0 ... 0
2 0 0 0 0 0 0 0 0 0 0 ... 0
3 1 0 1 0 0 0 0 0 1 0 ... 0
4 0 0 1 0 0 0 0 0 0 0 ... 0
The ideal results will be to preserve the format of the original market matrix with row and columns indexed from 1.
Any ideas how to correct my code?
Thanks!
you can specify the index and column for the dataframe
Matrix = (mmread('file_name.mtx'))
B = Matrix.todense()
df = pd.DataFrame(B, range(1, B.shape[0] + 1), range(1, B.shape[1] + 1))
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0