nlp python clean text from code

I need to clean a lot of text files from useless code or exception, in order to make some text analysis, for example:

start-text: 7001

Add a working set
Search for something in that working set
Remove the working set
Search via context menu

==>

Log: Mon Dec 17 17:23:54 GMT+01:00 2001 4 org.eclipse.ui 0 java.util.ConcurrentModificationException

java.util.ConcurrentModificationException
    at java.util.AbstractList$Itr.checkForComodification(AbstractList.java(Compiled
Code))
    at java.util.AbstractList$Itr.next(AbstractList.java(Compiled Code))
    at

org.eclipse.jdt.internal.ui.search.JavaSearchSubGroup.fill(JavaSearchSubGroup.java:30)
    at org.eclipse.jdt.internal.ui.search.JavaSearchGroup.fill(JavaSearchGroup.java:51)
    at org.eclipse.jdt.internal.ui.actions.ContextMenuGroup.add(ContextMenuGroup.java:25)
    at
org.eclipse.jdt.internal.ui.packageview.PackageExplorerPart.menuAboutToShow(PackageExplorerPart.java:498)
    at org.eclipse.jface.action.MenuManager.fireAboutToShow(MenuManager.java:220)
    at org.eclipse.jface.action.MenuManager.handleAboutToShow(MenuManager.java:253)
    at org.eclipse.jface.action.MenuManager.access$0(MenuManager.java:250)
    at org.eclipse.jface.action.MenuManager$1.menuShown(MenuManager.java:280)

<==

end-text: 7001

or:

start-text: 7019

20011211 Ran the following compilation unit under the debugger with the breakpoint indicated. To get Windows to hit the breakpoint, you have to have the right dl and run an accessibility client. If you cannot replicate this problem with a simpler little example, I can walk you through the steps to do this. The only thing different about this CU is that it contains a non-public class as well as a public class. When I hit the breakpoint in the debugger, I got a dialog that told me that it can't find the source for the non-public class. The dialog is very persistent - I have told it OK and Cancel, but it keeps coming back. Even if I switch to the Java perspective, I still get the nagging dialog . If I kill the process, the dialog does not come back. But the point is that the debugger should be able to see the source for this class - it is right in my eclipse workspace. It isn't even hidden in some jar somewhere - it's very visible. I suspect that it's the non-public class thing that is confusing the source lookup. If it helps any, I will attach the dialog. Here's the code:

==>

package test;

import org.eclipse.swt.*;
import org.eclipse.swt.graphics.*;
import org.eclipse.swt.widgets.*;
import org.eclipse.swt.layout.*;
import org.eclipse.swt.events.*;
import org.eclipse.swt.internal.win32.*;
import org.eclipse.swt.internal.ole.win32.*;
import org.eclipse.swt.ole.win32.*;

public class AccessibilityTest {
    static Display display;
    static Shell shell;
    static FakeWidget fakeWidget;

    public static void main(String[] args) {
        display = new Display();
        shell = new Shell(display);
        shell.setLayout(new GridLayout());
        shell.setText("Accessibility Test");

        fakeWidget = new FakeWidget(shell, SWT.MULTI);
        fakeWidget.setLayoutData(new GridData(GridData.FILL_BOTH));
        shell.setSize(140, 110);
        shell.open();
        while (!shell.isDisposed()) {
            if (!display.readAndDispatch())
                display.sleep();
        }
    }
}



private static GUID IIDFromString(String lpsz) {
    char[] buffer = (lpsz + "\0").toCharArray();
    GUID lpiid = new GUID();
    if (COM.IIDFromString(buffer, lpiid) == COM.S_OK)
        return lpiid;
    return null;
}

<==

end-text: 7019

The results must be:

start-text: 7001

Add a working set
Search for something in that working set
Remove the working set
Search via context menu

end-text: 7001

and

start-text: 7019

end-text: 7019

in above cases the useless text is between "==>" code "<==" (the arrows aren't in the text) ...I'm using python now... But I need a tool that clean all the text from code or exceptions... does it exist? because I think that could be useless and wrong to make nlp in these dirty texts...

Solution

This is a non-trivial problem and there is no predefined solution for this because it depends on your data. Howver, different approaches exist to separate text (natural language, NL) from code but it is not guaranteed that they work 100% of the time.

Here is my suggestion:

First, you can check whether some kind of formatting is used that separates the code from NL (like GitHubs markdown) and compile appropriate regular expressions to detect the code. I used the following regexes to cleanup issues extracted from GitHub:

leading_whitespace_pattern = re.compile(r"^( {4,}|\t( |\t)*).*?$", re.MULTILINE)
backtick_pattern = re.compile(r"```.*?```", re.DOTALL)

And here are some more for issues extracted from Redmine:

code_pattern = re.compile('<pre>.*?</pre>', re.DOTALL)
at_pattern = re.compile(r"@.*?@")

If that does not work out for you, things get tricky. You will either have to develop more regexes that match all the lines of code that may occur in your data or you will have to use more advanced approaches. Bacchelli et al. did a lot of research on this topic and used different techniques with good results. However, I am unsure whether they published their implementation:

A. Bacchelli, M. D’Ambros, and M. Lanza, “Extracting Source Code from E-Mails,” in 18th IEEE International Conference on Program Comprehension (ICPC 2010), 2010, pp. 24–33.
A. Bacchelli, A. Cleve, M. Lanza, and A. Mocci, “Extracting Structured Data from Natural Language Documents with Island Parsing,” in 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), 2011, pp. 476–479.
N. Bettenburg, B. Adams, A. E. Hassan, and M. Smidt, “A Lightweight Approach to Uncover Technical Artifacts in Unstructured Data,” in 19th IEEE International Conference on Program Comprehension (ICPC 2011), 2011, pp. 185–188.

Good luck!