Search code examples

Performing Coreference Resolution on .mtx data file

I'm attempting to perform Coreference Resolution on this BBC dataset:

Using the Neuralcoref model seen here:

However, having never worked with the .mtx file format, I'm stumped how I should pass the BBC data from the .mtx format to the spacy (and neuralcoref) pipeline.

I realize I have to use the mmread module to read the data, but how exactly would I pass the .mtx data to Spacy and Neuralcoref? Here's what I've done so far:

from import mmread

# Specify the path to the .mtx file
file_path = "data/bbc.mtx"

# Read the .mtx file
matrix = mmread(file_path)

# Print the matrix

Then, Neuralcoref's sample goes like this:

# Load your usual SpaCy model (one of SpaCy English models)
import spacy
nlp = spacy.load("en_core_web_sm")

# Add neural coref to SpaCy's pipe
import neuralcoref

# You're done. You can now use NeuralCoref as you usually manipulate a SpaCy document annotations.
doc = nlp("My sister has a dog. She loves him.")


I tried simply passing the matrix variable as

doc = nlp(matrix)

but didn't get what I expected. Would really appreciate some help, as I feel I'm out of my depth.


  • this won't work because the matrix from the .mtx file is a sparse matrix and doesnt contain the text required for coreference resolution.

    you are looking for something like this i think

    import spacy
    import neuralcoref
    # Load SpaCy model
    nlp = spacy.load("en_core_web_sm")
    # Add NeuralCoref to the pipeline
    # Preprocess and concatenate the BBC text data
    # Replace this with your actual preprocessing code to extract the relevant text
    bbc_text = "My sister has a dog. She loves him."
    # Process the BBC text data
    doc = nlp(bbc_text)
    # Perform coreference resolution
    clusters = doc._.coref_clusters
    # Print the coreference clusters
    for cluster in clusters:
        main_mention = cluster.main
        mentions = cluster.mentions
        print(f"Main mention: {main_mention.text}")
        print(f"Mentions: {[mention.text for mention in mentions]}")