Search code examples
javasortingldatopic-modelingmallet

Change order of columns in topic distribution file in MALLET


MALLET generates a tab-separated file with the topic distribution of each document by using the --output-doc-topics parameter while training the topic model. It kind of looks like this:

doc#    filename    topic#    weight
0    file:/.../document_01.txt    3     0.2110215053763441    14    0.1330645161    ...

However, I need this file differently sorted for further processing. Right now the columns are sorted by descending topic weights (0.211..., 0.133... etc.). But is it also possible to sort it by ascending topic numbers (0, 1, 2, ...) and their corresponding weights?

Initially, I thought the sorting could be done with Excel, but the file is just too large (> 20 GB).

Is there maybe a MALLET parameter for this? I have already looked through the --help section, but did not find anything relevant.

Otherwise, could you recommend a tool or API, which is capable of this kind of sorting?

Thank you!


Solution

  • If you get the latest version (2.0.8), the default is to display all topics in sorted order by topic id:

    --doc-topics-max INTEGER
      When writing topic proportions per document with --output-doc-topics, do not print more than INTEGER number of topics.  A negative value indicates that all topics should be printed.
      Default is -1