MALLET generates a tab-separated file with the topic distribution of each document by using the --output-doc-topics
parameter while training the topic model. It kind of looks like this:
doc# filename topic# weight
0 file:/.../document_01.txt 3 0.2110215053763441 14 0.1330645161 ...
However, I need this file differently sorted for further processing. Right now the columns are sorted by descending topic weights (0.211..., 0.133... etc.). But is it also possible to sort it by ascending topic numbers (0, 1, 2, ...) and their corresponding weights?
Initially, I thought the sorting could be done with Excel, but the file is just too large (> 20 GB).
Is there maybe a MALLET parameter for this? I have already looked through the --help
section, but did not find anything relevant.
Otherwise, could you recommend a tool or API, which is capable of this kind of sorting?
Thank you!
If you get the latest version (2.0.8), the default is to display all topics in sorted order by topic id:
--doc-topics-max INTEGER
When writing topic proportions per document with --output-doc-topics, do not print more than INTEGER number of topics. A negative value indicates that all topics should be printed.
Default is -1