Search code examples
pythonpandasdataframetext-analysis

Transform Matrix Market matrix into pandas Data frame python


I have a Market Matrix file, which I have to use for carrying out text analyses.

The market file has the following structure:

%%MatrixMarket matrix coordinate integer general
2000 5000 23000
1 4300 1
1 2200 1
1 3000 1
1 600  1

The values in the second lines indicate the number of rows, number of columns, and total number of non-zero values in the matrix. All lines after this contain 3 values:

  • the row (indexed from 1), which represents my text document;
  • the column (index from 1), which represents a word;
  • the term frequency.

As read in many posts I read this file, using scipy.io.mmread and the new API for dealing with parse data structure.

In particular, I used the following code:

    Matrix = (mmread('file_name.mtx'))
    B = Matrix.todense()
    df = pd.DataFrame(B)
    print(df.head())

However, from this code I got a data frame indexed from 0:

        0     1     2     3     4     5     6     7     8     9     ...   4872  \
0     1     0     1     0     0     0     0     0     1     0  ...      0   
1     0     0     0     0     0     0     0     0     0     0  ...      0   
2     0     0     0     0     0     0     0     0     0     0  ...      0   
3     1     0     1     0     0     0     0     0     1     0  ...      0   
4     0     0     1     0     0     0     0     0     0     0  ...      0  

The ideal results will be to preserve the format of the original market matrix with row and columns indexed from 1.

Any ideas how to correct my code?

Thanks!


Solution

  • you can specify the index and column for the dataframe

    Matrix = (mmread('file_name.mtx'))
    B = Matrix.todense()
    df = pd.DataFrame(B, range(1, B.shape[0] + 1), range(1, B.shape[1] + 1))
    print(df.iloc[:5, :5])
    
       1  2  3  4  5
    1  0  0  0  0  0
    2  0  0  0  0  0
    3  0  0  0  0  0
    4  0  0  0  0  0
    5  0  0  0  0  0