Improving Javadoc regex

I'm currently using this fragment in a Python script to detect Javadoc comments:

# This regular expression matches Javadoc comments.
pattern = r'/\*\*(?:[^*]|\*(?!/))*\*/'
# Here's how it works:
# /\*\*    matches leading '/**' (have to escape '*' as metacharacters)
# (?:      starts a non-capturing group to match one comment character
#  [^*]    matches any non-asterisk characters...
#  |       or...
#  \*      any asterisk...
#   (?!/)  that's not followed by a slash (negative lookahead)
# )        end non-capturing group
# *        matches any number of these non-terminal characters
# \*/      matches the closing '*/' (again, have to escape '*')
comments = re.findall(pattern, large_string_of_java_code)

This regex doesn't work perfectly. I'm okay with it not matching Unicode escape sequences (e.g., the comment /** a */ can be written as \u002f** a */). The main problem that I have is that it will yield a false positive on a comment like this:

// line comment /** not actually a javadoc comment */

and will probably break on comments like this:

// line comment /** unfinished "Javadoc comment"
// regex engine is still searching for closing slash

I tried using a negative lookbehind for ^.$//, but, according to the Python docs,

…the contained pattern must only match strings of some fixed length.

So that doesn't work.

I also tried starting from the beginning of the line, something like this:

pattern = r'^(?:[^/]|/(?!/))*(the whole regex above)'

but I couldn't get this to work.

Are regular expressions appropriate for this task? How can I get this to work?

If regex isn't the right tool, I'm happy to use any lightweight-ish built-in Python 2 module.

Solution

If you want exactitude and you're working with Java code, you're likely better off integrating with javadoc (or doxygen). Maybe this would help: How to extract JavaDoc comments from the source files

If you don't need exactitude, you should be able to get regular expressions working well enough for most cases by proceeding by stages: maybe start by eliminating the confusing parts first (// and non-javadoc /* */ comments), then look for javadoc comments. But you also have to decide on an approach for dealing with Javadoc delimiters that happen to be embedded in strings... The problem is more about lexical analysis. Maybe that is enough for your application?