Search code examples
bashmarkdownfileparsing

Parse a markdown file in bash to get all indented lines and their position in the file


I am trying to get all indented lines in a markdown file in bash. I need their position in the file in order to be able to later extract or insert them again at their original position.

Below an example of markdown file for which I want to get all indented lines.

# Example bloc code

This is a bloc code

    function display_results() {
        awk '{print $0; system("sleep .5");}' $1
        rm $1
    }

This code displays results.

below an other example of bloc code

    echo "------------------------------------------"
    echo "              TEST RESULTS"
    echo "------------------------------------------"

Or just one line:

    System.out.println("foo");

blablablab

Because I want the position of the bloc I parse the file line by line and look if the line is indented by using a regex.

Note: It is here mentionned that regex is not the right tool to get bloc code because it can happen that a bloc code be nested . I don´t have to handle this use case, and getting only normal bloc code as presented in the example above will be sufficient.

my code is:

# One of the regex I have tested
regex='^[[:blank:]]+'  #Not find any line

while read line; do
  # Try to find indented lines by using regex
  if [[ $line =~ $regex ]]; then
      echo "INDENTED: $line"
  else
      echo "TEXT: $line"
  fi
done < $testFile

where $testFile is the markdown file that I parse.

For now the best regex that I wrote (based on this answer and this one) match only some lines but not all of them.

With the following regex for example, I only get some of the lines but not all:

regexblank="[^a-zA-Z#]+[[:blank:]]"
regexspace="[^a-zA-Z#]+[[:space:]]"
blank="[^a-zA-Z#]+[[:blank:]]"

With the regex above the result is:

TEXT: # Example bloc code
TEXT:
TEXT: This is a bloc code
TEXT:
INDENTED: function display_results() {
INDENTED: awk '{print main.sh; system("sleep .5");}'
TEXT: rm
TEXT: }
TEXT:
TEXT: This code displays results.
TEXT:
TEXT: below an other example of bloc code
TEXT:
TEXT: echo "------------------------------------------"
INDENTED: echo "              TEST RESULTS"
TEXT: echo "------------------------------------------"
TEXT:
TEXT: Or just one line:
TEXT:
TEXT: System.out.println("foo");
TEXT:
TEXT: blablablab

As you can see I have to specify in the three regex that the line must not begin with a letter or a # otherwise some lines as the title are detected as indented.

Using awk as follow gives me all indented lines

awk '/^(\t|\s)+/' $mdFile

But awk works only on file and I need to have the position of each bloc.

How to parse a file and get all the lines that are indented? As I explained I am trying with regex, but any solution to get the indented lines and their respective position in the file will be great.

You can find the code and all the regex that I wrote here


Solution

  • Look at what line contains on each line:

    $ cat infile
    line
        indented
    line
    $ while read line; do echo "<$line>"; done < infile
    <line>
    <indented>
    <line>
    

    This is because of this behaviour of read (emphasis mine):

    One line is read from the standard input [...], split into words as described above in Word Splitting, and the first word is assigned to the first name, [...]

    To prevent that, set IFS to the empty string (and add -r for good measure to avoid backslash interpretation):

    $ while IFS= read -r line; do echo "<$line>"; done < infile
    <line>
    <    indented>
    <line>