Search code examples
bashsedwhile-loopshsubshell

Linux shell: remove lines from an file reading another file


Let's consider 2 text file, one 'main_list', and one 'ignore_list'. For each line in the ignore_list, I want to remove the line starting with that string in the main_line.

basically something doable with sed and a while loop.

E.g.

while read line; do echo ^$line; sed -i "/^$line/d" ./main_list; done < ./ignore_list

In a better way, I wanted to first create the sed pattern and then run it once:

while read line; do
    if [ $SED_PATTERN="" ]; then 
      SED_PATTERN="^$line"
    else
      SED_PATTERN=$SED_PATTERN"\|^$line"
    fi
  done < ./ ignore_list
echo $SED_PATTERN
sed -i "/$SED_PATTERN/d" ./main_list

unfortunately, because of the sub shell used by the while loop, it does not work.

A variable modified inside a while loop is not remembered and https://mywiki.wooledge.org/BashFAQ/024 are giving worthful explanations and workaround. I haven't managed it yet to get one working in a simple way.

Ideally, I want to use the sh shell (the script will run in a gitlab pipeline with a simple alpine image)

Any idea to keep it simple before I move to a python script (and use a fat image instead of alpine - in between, I can also use one with bash)

Maybe another approach than sed and the while loop?

Thanks.

edit: some more context about the content of both files: I am dealing with a list of debian packages installed from a build step. The main_list is then an output of a dpkg-query command (see below), so should not contain too fancy characters. The ignore_list contains the packages I want to ignore for another post processing step, containing internal components not relevant for my output.

Here a small extract of both files

main_list

e2fsprogs|1.46.2-2|e2fsprogs|1.46.2-2
ebtables|2.0.11-4|ebtables|2.0.11-4
edgeonboarding-config|0.1|edgeonboarding-config|0.1
efibootguard|0.13+cip|efibootguard|0.13+cip
ethtool|1:5.9-1|ethtool|1:5.9-1

for the ignore_list

edgeonboarding-config

You can generate the main_list on any linux system by running

dpkg-query -f '${source:Package}|${source:Version}|${binary:Package}|${Version}\n' -W > main_list

and for the ignore_list, just pick-up a few string from the main_list (begining of the lines)

EDIT2: anyway, my initial idea with a while loop is not necessary. I just need

  • one sed command over ignore_list to replace any line $myline and return carriage with ^$myline|
  • set the output as SED_PATTERN
  • and set run another sed command: sed -i "/$SED_PATTERN/d" ./main_list

Solution

  • Using any POSIX awk given the input/output you've recently added to your question:

    awk -F'|' '
        NR==FNR {
            sub(/[[:space:]]+$/,"")
            ign[$0]
            next
        }
        !($1 in ign)
    ' ignore_list main_list
    

    That is doing a literal full string comparison against just the first |-separated field of each line.

    If you were to use sed and/or grep for this then you'd need to escape all possible regexp metachars in ignore_list first, see is-it-possible-to-escape-regex-metacharacters-reliably-with-sed.


    Original answer before you showed us sample input/output:

    Using any POSIX awk (untested due to no sample input/output provided):

    awk '
        NR==FNR {
            sub(/[[:space:]]+$/,"")
            ign[$0]
            next
        }
        {
            for ( str in ign ) {
                if ( index($0,str) == 1 ) {
                    next
                }
            }
        }
    ' ignore_list main_list
    

    That is doing a literal substring string comparison against just the start of each line.

    If you were to use sed and/or grep for this then you'd need to escape all possible regexp metachars in ignore_list first, see is-it-possible-to-escape-regex-metacharacters-reliably-with-sed.