Search code examples
pythongraphnlptokenizetext-mining

textmining graph sentences in python


I'm trying to solve a text mining problem in python which consist on:

Target: Create a graph composed of nodes(sentences) by tokenizing a paragraph into sentences, their edges would be their similarity.

This isn't new at all, but the crux of the question is not very treated on the Internet. So, after getting the sentences from a paragraph the interesting point would be to compute a matrix of similarity between sentences(all combinations) to draw the graph.

Is there any package to perform the similairty between several vectors in an easy way?, even with a given list of strings make a graph of similarity...

A reproducible example:

# tokenize into sentences
>>> from nltk import tokenize
>>> p = "Help my code works. The graph isn't still connected. The code computes the relationship in a graph. "
>>> sentences=tokenize.sent_tokenize(p)
  ['Help my code works.', "The graph isn't still connected.", 'The code computes the relationship in a graph.']
>>> len (sentences)
3

# compute similarity with dice coeffcient.
>>>def dice_coefficient(a, b):
    """dice coefficient 2nt/na + nb."""
    a_bigrams = set(a)
    b_bigrams = set(b)
    overlap = len(a_bigrams & b_bigrams)
    return overlap * 2.0/(len(a_bigrams) + len(b_bigrams)
>>>dice_coefficient(sentences[1],sentences[2])
0.918918918918919

So, with this function I can do it manually and later make the graph with the nodes and the edges. But always a global solution (with n sentences) is the best one.

Any suggestion?


Solution

  • The following list comprehension creates a list of tuples where the first two elements are indexes and the last one is the similarity:

    edges = [(i,j,dice_coefficient(x,y)) 
             for i,x in enumerate(sentences) 
             for j,y in enumerate(sentences) if i < j]
    

    You can now remove the edges that are under a certain threshold, and convert the remaining edges into a graph with networkx:

    import networkx as nx
    G = nx.Graph()
    G.add_edges_from((i,j) for i,j,sim in edges if sim >= THRESHOLD)