Search code examples
bashsubstringstring-length

Need help for string manipulation in a bash script


I'm not use to the syntax of bash script. I'm trying to read a file. For each line I want to keep only the part of the string before the delimiter '/' and put it back into a new file if the word respect a perticular length. I've download a dictionary, but the format does not meet my expectation. Since there is 84000 words, I don't really want to manualy remove what after the '/' for each word. I though it would be an easy thing and I follow couple of idea in other similar question on this site, but it seem that I'm missing something somewhere because it still doesn't work. I can't get the length right. The file Test_Input contains one word per line. Here's the code:

#!/usr/bin/bash
filename="Test_Input.txt"
while read -r line
do
    sub= echo $line | cut -d '/' -f1
    length= echo ${#sub}
    if $length >= 4 && $length <= 10;
        then echo $sub >> Test_Output.txt
    fi
done < "$filename"

Solution

  • Several items:

    1. I assume that you have been using single back-quotes in the assignments, and not literally sub= echo $line | cut -d '/' -f1, as this would have certainly failed. Alternatively, you can also use sub=$(), as in $(echo $line | cut -d '/' -f1)
    2. The conditions in an if clause need to be encompassed by single or double [], like this: if [[ $length -ge 4 ]] && [[ $length -le 10 ]];
    3. Which brings me to the next point: <= doesn't reliably work in bash. Just use -ge for "greater or equal" and -le for "less or equal".
    4. If your line does not contain any / characters, in your version sub will contain the whole line. This might not be what you want, so I'd advise to also add the -s flag to cut.
    5. You don't need somevar=$(echo $someothervar). Just use somevar=$someothervar

    Here's a version that works:

    #!/usr/bin/env bash
    filename="Test_Input.txt"
    while read -r line
    do
        sub=$(echo $line | cut -s -d '/' -f 1)
        length=${#sub}
        if [[ $length -ge 4 ]] && [[ $length -le 10 ]];
            then echo $sub >> Test_Output.txt
        fi
    done < "$filename"
    

    Of course, you could also just use sed:

    sed -n -r '/^[^/]{4,10}\// s;/.*$;;p' Test_Input.txt > Test_Output.txt
    

    Explanation:

    • -n Don't print anything unless explicitly marked for printing.
    • -r Use the extended regex
    • /<searchterm>/ <operation> Search for lines that match a certain criteria, and perform this operation:
      • Searchterm is: ^[^/]{4,10}\/ From the beginning of the line, there should be between 4 and 10 non-slash characters, followed by the slash
      • Operation is: s;/.*$;;p replace everything between the first slash and the end of the line with nothing, then print.