Search code examples
regexbashcsvawkcut

Separating text with commas out of CSV?


I'm trying to write some RHEL security hardening automation scripts, and I've got a CSV file that I'm trying to generate the information into readable content. Here's what I have so far...

#!/bin/bash

# loop through the file
while read line; do
        # get all of the content
        vulnid=`echo $line | cut -d',' -f1`
        ruleid=`echo $line | cut -d',' -f2`
        stigid=`echo $line | cut -d',' -f3`
        title=`echo $line | cut -d',' -f4`
        discussion=`echo $line | cut -d',' -f5`
        check=`echo $line | cut -d',' -f6`
        fix=`echo $line | cut -d',' -f7`

        # Format the content

        echo "########################################################"
        echo "# Vulnerability ID: $vulnid"
        echo "# Rule ID: $ruleid"
        echo "# STIG ID: $stigid"
        echo "#"
        echo "# Rule: $title"
        echo "#"
        echo "# Discussion:"
        echo "# $discussion"
        echo "# Check:"
        echo "# $check"
        echo "# Fix:"
        echo "# $fix"
        echo "########################################################"
        echo "# Start Check"
        echo
        echo "# Start Remediation"
        echo
        echo "########################################################"

done < STIG.csv

The problem that I'm having with this is that my text in the CSV contains commas. This is actually perfectly fine and in accordance with the IETF standard (https://www.rfc-editor.org/rfc/rfc4180#page-2 Section 2.4). However, as you can imagine, the cut command doesn't look ahead to see if there's a trailing space after the comma (as you would normally have in English). This is causing all of my fields to get messed up, and I can't figure out how to get this all working right.

Now, I have a feeling that there's some magical regex I can use like ',![:blank:]', but I'll be damned if I know how to utilize it. I'm used to using cut just because it's quick and dirty, but perhaps someone has a better suggestion using maybe awk or sed. This is mainly to generate the bulk structure of my program which repeats itself and is a TON of commentary.

An added note, this has to run on a clean install of RHEL6. I would write this in Ruby, Python, whatever. However, most of those are extra packages which have to be installed. The environment which this script would be deployed is where machines don't have any internet access or extra packages. Python 2.6 is on CentOS6 by default, but RHEL6 (I think). Otherwise, trust me, I'd be writing this whole thing in Ruby.

Here's a sample of the CSV:

V-38447,SV-50247r1_rule,RHEL-06-000519,The system package management tool must verify contents of all files associated with packages.,The hash on important files like system executables should match the information given by the RPM database. Executables with erroneous hashes could be a sign of nefarious activity on the system.,"The following command will list which files on the system have file hashes different from what is expected by the RPM database. # rpm -Va | grep '$1 ~ /..5/ && $2 != 'c''If there is output, this is a finding.","The RPM package management system can check the hashes of installed software packages, including many that are important to system security. Run the following command to list which files on the system have hashes that differ from what is expected by the RPM database: # rpm -Va | grep '^..5'A 'c' in the second column indicates that a file is a configuration file, which may appropriately be expected to change. If the file that has changed was not expected to then refresh from distribution media or online repositories. rpm -Uvh [affected_package]OR yum reinstall [affected_package]"

Also, if anyone is curious, this whole project is out on GitHub.


Solution

  • All the comments on your question are good ones. There is no support for CSV built in to bash, so if you don't want to use a language like Python, Ruby, Erlang or even Perl, you have to roll your own.

    Note that while awk can use commas as a field separator, it also does not properly support CSV with commas embedded in quoted fields. You can hack together a solution with a pattern, as Håkon suggested.

    But you don't need to do this in awk; you can do it just as well in bash alone and avoid calling an external tool. How about something like this?

    #!/bin/bash
    
    nextfield () {
      case "$line" in
        \"*)
          value="${line%%\",*}\""
          line="${line#*\",}"
          ;;
        *)
          value="${line%%,*}"
          line="${line#*,}"
          ;;
      esac
    }
    
    # loop through the file
    while read line; do
    
      # get the content
      nextfield; vulnid="$value"
      nextfield; ruleid="$value"
      nextfield; stigid="$value"
      nextfield; title="$value"
      nextfield; discussion="$value"
      nextfield; check="$value"
      nextfield; fix="$value"
    
      # format the content
      printf "########################################################\n"
      printf "# Vulnerability ID: %s\n" "$vulnid"
      printf "# Rule ID: %s\n# STIG ID: %s\n#\n" "$ruleid" "$stigid"
      printf "# Rule: %s\n" "$title"
      printf "#\n# Discussion:\n"
      fmt -w68 <<<"$discussion" | sed 's/^/#   /'
      printf "# Check:\n"
      fmt -w68 <<<"$check" | sed 's/^/#   /'
      printf "# Fix:\n"
      fmt -w68 <<<"$fix" | sed 's/^/#   /'
      printf "########################################################\n"
      printf "# Start Check\n\n"
      printf "# Start Remediation\n\n"
      printf "########################################################\n"
    
    done < STIG.csv
    

    The speed advantage will be tremendous, if you're doing a lot of these.

    Note the improved formatting, courtesy of fmt. This kind of kills the speed benefit of avoiding the call to an external program, but it sure makes your output easier to read. :)