Search code examples
bashtopic-modelingmallet

how to run topic model on 20000 documents at once?


I have 20000 news documents to run topic modeling on it:

I want to see the topic dynamics and evolution from the documents. I tried to use the following batch script with Topic modeling by mallet but not work.

#!/bin/bash
for filename in /Users/JasonDou/code/internet_finance/bydocafterseg2; do
    ./bin/mallet import-dir --input /Users/JasonDou/code/internet_finance/bydocafterseg2/159047443.txt  --output bydoc-input.mallet --keep-sequence --remove-stopwords
done

Solution

  • You are missing an asterisk:

    #!/bin/bash
    for filename in "/Users/JasonDou/code/internet_finance/bydocafterseg2/"*; do
        [ -e "$filename" ] || continue
        ./bin/mallet import-dir --input "$filename" \
          --output bydoc-input.mallet --keep-sequence --remove-stopwords
    done
    

    The above will list iterate over each file in bydocafterseg2. You can change it to all .txt files with: "bydocafterseg2/"*".txt"