I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like:
Popular passages
... direction, indeterminate. We have not settled, because we have not anticipated, the question which will be raised by the unenvisaged case when it occurs; whether some degree of peace in the park is to be sacrificed to, or defended against, those children whose pleasure or interest it is to use these things. When the unenvisaged case does arise, we confront the issues at stake and can then settle the question by choosing between the competing interests in the way which best satisfies us. In doing... Page 86
Appears in 15 books from 1968-2003
This would be a world fit for "mechanical" jurisprudence. Plainly this world is not our world; human legislators can have no such knowledge of all the possible combinations of circumstances which the future may bring. This inability to anticipate brings with it a relative indeterminacy of aim. When we are bold enough to frame some general rule of conduct (eg, a rule that no vehicle may be taken into the park), the language used in this context fixes necessary conditions which anything must satisfy... Page 86
It must be an intensive pattern matching process. I can only think of n-gram models, text corpus, automatic plagisrism detection. But, sometimes n-grams are probabilistic models for predicting the next item in a sequence and text corpus (to my knowledge) are manually created. And, in this particular case, popular passages, there can be a great deal of words.
I am really lost. If I wanted to create such a feature, how or where should I start? Also, include in your response what programming languages are best suited for this stuff: F# or any other functional lang, PERL, Python, Java... (I am becoming a F# fan myself)
PS: can someone include the tag automatic-plagiarism-detection, because i can't
Read this ACM paper by Kolak and Schilit, the Google researchers who developed Popular Passages. There are also a few relevant slides from this MapReduce course taught by Baldridge and Lease at The University of Texas at Austin.