Search code examples
bashtextdata-manipulationcutcat

bash scripting text file manipulation questions


I have a text file that looks like the below. The delimiters are spaces. As you can see, the delimiters sometimes are doubled or tripled so that simultaneous delimiters should be treated as a single delimiter. Also, I am looking to transform the date into a MySQL timestamp format.

   889468    216 -rw-r--r--   1 wls1     wls1       217868 Nov  1 00:42 /home/wls1/1800WLS610Entry_20191031194242110_C0NTRA.jpg
  2889469    228 -rw-r--r--   1 wls1     wls1       231092 Nov  1 01:21 /home/wls1/1800WLS610Entry_20191031202145570_FPP3360.jpg
  2889471    196 -rw-r--r--   1 wls1     wls1       197452 Nov  1 01:55 /home/wls1/1800WLS610Entry_20191031205544650_0NLY.jpg
  2889470    196 -rw-r--r--   1 wls1     wls1       199512 Nov  1 01:55 /home/wls1/1800WLS610Entry_20191031205544720_C0NTRACT.jpg
  2889472    236 -rw-r--r--   1 wls1     wls1       240152 Nov  1 01:57 /home/wls1/1800WLS610Entry_20191031205719060_KSK6973.jpg
  2889473    232 -rw-r--r--   1 wls1     wls1       236876 Nov  1 01:57 /home/wls1/1800WLS610Entry_20191031205748650_KSK6973.jpg
  2889474    224 -rw-r--r--   1 wls1     wls1       229292 Nov  1 04:22 /home/wls1/1800WLS610Entry_20191031232239000_0NLY.jpg
  2889475    228 -rw-r--r--   1 wls1     wls1       230476 Nov  1 04:28 /home/wls1/1800WLS610Entry_20191031232853120_0NLY.jpg
  2889477    224 -rw-r--r--   1 wls1     wls1       228708 Nov  1 04:31 /home/wls1/1800WLS610Entry_20191031231809320_C0NTRACT.jpg
  2889476    216 -rw-r--r--   1 wls1     wls1       219104 Nov  1 04:31 /home/wls1/1800WLS610Entry_20191031233143530_CTP75.jpg

I need to extract the full path of the file name, the time stamp, and the username of the owner. So that the resulting file looks like this below. The delimiter should be a single tab character. And the date field should be converted into a MySQL timestamp.

/home/wls1/1800WLS610Entry_20191031194242110_C0NTRA.jpg     wls1    2019-11-01 00:42:00
/home/wls1/1800WLS610Entry_20191031202145570_FPP3360.jpg    wls1    2019-11-01 01:21:00
/home/wls1/1800WLS610Entry_20191031205544650_0NLY.jpg       wls1    2019-11-01 01:55:00
/home/wls1/1800WLS610Entry_20191031205544720_C0NTRACT.jpg   wls1    2019-11-01 01:55:00
/home/wls1/1800WLS610Entry_20191031205719060_KSK6973.jpg    wls1    2019-11-01 01:57:00
/home/wls1/1800WLS610Entry_20191031205748650_KSK6973.jpg    wls1    2019-11-01 01:57:00
/home/wls1/1800WLS610Entry_20191031232239000_0NLY.jpg       wls1    2019-11-01 04:22:00
/home/wls1/1800WLS610Entry_20191031232853120_0NLY.jpg       wls1    2019-11-01 04:28:00
/home/wls1/1800WLS610Entry_20191031231809320_C0NTRACT.jpg   wls1    2019-11-01 04:31:00
/home/wls1/1800WLS610Entry_20191031233143530_CTP75.jpg      wls1    2019-11-01 04:31:00

To accomplish the above, I have been trying to use cat and cut as such:

cat text.txt | cut -d ' ' -f 12,25,27,28,29

I vary the argument for the -f directive to tell cut which columns I want, but I see that it won't treat simultaneous spaces as a single delimiter.

The above cat/cut statement yields the following:

1 217868  1 00:42
wls1 Nov 1 01:21 /home/wls1/1800WLS610Entry_20191031202145570_FPP3360.jpg
wls1 Nov 1 01:55 /home/wls1/1800WLS610Entry_20191031205544650_0NLY.jpg
wls1 Nov 1 01:55 /home/wls1/1800WLS610Entry_20191031205544720_C0NTRACT.jpg
wls1 Nov 1 01:57 /home/wls1/1800WLS610Entry_20191031205719060_KSK6973.jpg
wls1 Nov 1 01:57 /home/wls1/1800WLS610Entry_20191031205748650_KSK6973.jpg
wls1 Nov 1 04:22 /home/wls1/1800WLS610Entry_20191031232239000_0NLY.jpg
wls1 Nov 1 04:28 /home/wls1/1800WLS610Entry_20191031232853120_0NLY.jpg
wls1 Nov 1 04:31 /home/wls1/1800WLS610Entry_20191031231809320_C0NTRACT.jpg
wls1 Nov 1 04:31 /home/wls1/1800WLS610Entry_20191031233143530_CTP75.jpg

So, the above is a step in the right direction.

But notice that top line? The file size is one character less in that line and so it messed it up. Also, I am uncertain how to re-arrange the order of the columns and re-format the time stamp.

Thanks in advance for your help!


Solution

  • If you want to start with the provided file text.txt, please try the following:

    declare -A m2n=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12)
    
    while IFS= read -r line; do
        fname="$(cut -c 73- <<< "$line")"
        read -r -a ary <<< "$line"
        date=$(printf "%04d-%02d-%02d" "$(date +%Y)" "${m2n[${ary[7]}]}" "${ary[8]}")
        time="${ary[9]}:00"
        printf "%s\t%s\t%s\t%s\n" "$fname" "${ary[4]}" "$date" "$time"
    done < "text.txt"
    

    Result:

    /home/wls1/1800WLS610Entry_20191031194242110_C0NTRA.jpg wls1    2019-11-01      00:42:00
    /home/wls1/1800WLS610Entry_20191031202145570_FPP3360.jpg        wls1    2019-11-01      01:21:00
    /home/wls1/1800WLS610Entry_20191031205544650_0NLY.jpg   wls1    2019-11-01      01:55:00
    /home/wls1/1800WLS610Entry_20191031205544720_C0NTRACT.jpg       wls1    2019-11-01      01:55:00
    /home/wls1/1800WLS610Entry_20191031205719060_KSK6973.jpg        wls1    2019-11-01      01:57:00
    /home/wls1/1800WLS610Entry_20191031205748650_KSK6973.jpg        wls1    2019-11-01      01:57:00
    /home/wls1/1800WLS610Entry_20191031232239000_0NLY.jpg   wls1    2019-11-01      04:22:00
    /home/wls1/1800WLS610Entry_20191031232853120_0NLY.jpg   wls1    2019-11-01      04:28:00
    /home/wls1/1800WLS610Entry_20191031231809320_C0NTRACT.jpg       wls1    2019-11-01      04:31:00
    /home/wls1/1800WLS610Entry_20191031233143530_CTP75.jpg  wls1    2019-11-01      04:31:00
    

    Note that he columns are not visually aligned due to the variable length of the filenames.

    A potential problem of the script above is the acquisition of the year. The year information is missing in the file and you may need to add a conditional branch especially when you are crossing the years.

    If you can go back to the original files and you can directly execute find command over them, please try instead:

    find /home/wls1 -type f -name "*.jpg" -printf "%p\t%u\t%TY%Tm%Td\t%TH:%TM:%.2TS\n"
    

    which will bring you the desired output.
    Hope this helps.