Search code examples
regexcommand-linegrepwhitespace

Grep not recognizing white space


I have a file (the first chapter of Harry Potter) with large amounts of white space. For example:

 CHAPTER ONE
  The Boy Who Lived
   M r and Mrs Dursley, of number four, Privet Drive, were
   proud to say that they were perfectly normal, thank
   you very much. They were the last people you’d expect to be
   involved in anything strange or mysterious, because they just
   didn’t hold with such nonsense.
    Mr Dursley was the director of a fi rm called Grunnings,
    which made drills. He was a big, beefy man with hardly
    any neck, although he did have a very large moustache.
    Mrs Dursley was thin and blonde and had nearly twice the
    usual amount of neck, which came in very useful as she spent
    so much of her time craning over garden fences, spying on the
    neighbours. The Dursleys had a small son called Dudders and

My objective, while learning command line tools, is to (first identify with grep and then) remove all white space, as follows:

 CHAPTER ONE
The Boy Who Lived
M r and Mrs Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank
you very much. They were the last people you’d expect to be
involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.
Mr Dursley was the director of a fi rm called Grunnings,
which made drills. He was a big, beefy man with hardly
any neck, although he did have a very large moustache.
Mrs Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent
so much of her time craning over garden fences, spying on the
neighbours. The Dursleys had a small son called Dudders and

I'm trying to identify the lines with multiple white spaces using grep. In this, I've attempted the following (amongst others):

$ grep "(\s){2,}" file
$ grep "(\ ){2,}" file
$ grep "([[:space:]]){2,}" file
$ grep "[[:space:]]{2,}" file

None of these has produced any matches. I've confirmed that there is white space in there with Vim. I've similarly confirmed each of those syntaxes on regex101.com. I've also checked the file against grep " " file (and varieties) and seen all lines with any white space output correctly.

What is the correct syntax for this query?


Solution

  • Given:

    cat file
     CHAPTER ONE
      The Boy Who Lived
       M r and Mrs Dursley, of number four, Privet Drive, were
       proud to say that they were perfectly normal, thank
       you very much. They were the last people you’d expect to be
       involved in anything strange or mysterious, because they just
       didn’t hold with such nonsense.
        Mr Dursley was the director of a fi rm called Grunnings,
        which made drills. He was a big, beefy man with hardly
        any neck, although he did have a very large moustache.
        Mrs Dursley was thin and blonde and had nearly twice the
        usual amount of neck, which came in very useful as she spent
        so much of her time craning over garden fences, spying on the
        neighbours. The Dursleys had a small son called Dudders and
    

    Your best bet is sed to delete leading spaces:

    sed -E 's/^[[:blank:]]{2,}//' file
     CHAPTER ONE
    The Boy Who Lived
    M r and Mrs Dursley, of number four, Privet Drive, were
    proud to say that they were perfectly normal, thank
    you very much. They were the last people you’d expect to be
    involved in anything strange or mysterious, because they just
    didn’t hold with such nonsense.
    Mr Dursley was the director of a fi rm called Grunnings,
    which made drills. He was a big, beefy man with hardly
    any neck, although he did have a very large moustache.
    Mrs Dursley was thin and blonde and had nearly twice the
    usual amount of neck, which came in very useful as she spent
    so much of her time craning over garden fences, spying on the
    neighbours. The Dursleys had a small son called Dudders and
    

    Or with awk:

    awk '{sub(/^[[:blank:]]{2,}/,"")} 1' file
    # same output
    

    If you only want to identify those lines that have 2 or more spaces at the beginning with grep:

    grep -E '^[[:blank:]]{2,}' file
    

    The issue YOU were having is that grep and sed use Basic Regular Expressions (BRE) as a default. You need to use the -E option to trigger using Extended Regular Expressions (ERE).

    HERE is the difference BRE and ERE.

    awk uses ERE as a default.