Search code examples
pythonsvmtext-classificationword-embeddingdoc2vec

How to classify text documents in legal domain


I've been working on a project which is about classifying text documents in the legal domain (Legal Judgment Prediction class of problems).
The given data set consists of 700 legal documents (well balanced in two classes). After the preprocessing, which consists in applying all the best practices (such as deleting stopwords,etc.), there are 3 paragraphs for each document, which I could consider all together or separately. On average, the text documents size is 2285 words.

I aim to use something different from the classical n-grams model (which doesn't take into account any words order or semantic) :

  • Using a Neural Network (Doc2Vec) for transforming the text of each document into a vector in a continuous domain; in order to create a dataset with the vectors, representing the documents, and the corresponding labels (as I said there are 2 possible labels: 0 or 1);
  • Training a SVM for classifying the samples, I've been using a 10-fold cross-validation.

I was wondering if there's someone who has some experience in this particular domain, who can suggest me other ways or how to improve the model since I'm not getting particularly good results: 74% accuracy.

Is it correct using Doc2Vec for transforming text into vectors and using them for feeding a classifier?

My model represantation:

enter image description here


Solution

  • Doc2Vec is a reasonable way to tranform a variable-length text into a summary-vector, and these vectors are often useful for classification – especially topical or sentiment classification (two applications highlighted in the original 'Paragraph Vector' paper).

    However, 700 docs is extremely small as a training set. Published work has tended to use corpuses of tens-of-thousands to millions of documents.

    Also, your specific classification target – predicting a legal judgment – strikes me as much harder than topical or sentiment classification. Knowing how a case will be decided depends on a large body of outside law/precedent (that's not in the training-set), and logical deductions, sometimes on individual fine points of a situation. Those are things the fuzzy-summary of a single-text-vector are unlikely to capture.

    Against that, your reported 74% accuracy sounds downright impressive. (Would a lay person do as well, with just these summaries?) I wonder if there are certain 'tells' in the summaries – with word choices of the summarizer strongly hinting, or downright revealing, the actual judgment. If that's the strongest signal in the text (barring actual domain knowledge & logical reasoning), you might get just-as-good results from a more simple n-grams/bag-of-words representation and classifier.

    Meta-optimizing your training parameters might incrementally improve results, but I'd think you'd need a lot more data, and perhaps far more advanced learning techniques, to really approximate the kind of legally-competent human-level predictions you may be aiming for.