Search code examples
unixawkcut

UNIX: How to cut columns from the right where some not all fields are the same length


I have a list of data and I need to cut certain characters out of certain columns.

Here is the list :

JCG2380 GREEN, JULIE C          JR-II BISS CPSC BS   INFO TECH  XXX/XXX-9445
JAG1936 GREEN, JOE A.           SO-I  BISS CPSC BS   INFO TECH  XXX/XXX-7993
ACG4636 GREEN, ADAM C.          JR-II BISS CPSC BS   COMP SCI   XXX/XXX-0437
SPG1696 GREEN, SEAN P.          JR-I  BISS CPSC BS   COMP SCI   XXX/XXX-2398
SEG8835 GREEN, SHAWN E.         FR-II BISS CPSC BS   COMP SCI   XXX/XXX-7149
MCGo599 GREEN, MICHAEL C.       JR-I  BISS CPSC BS   COMP SCI   XXX/XXX-OOOO
GJG1887 GREEN, GREGORY J.       SO-II BISS CPSC BS   INFO TECH  XXX/XXX-4354
NGG5479 GREEN, NICHOLAS G       JR-I  BISS CPSC BS   INFO TECH  XXX/XXX-8268
ZTG7190 GREEN, ZACHARY T.       FR-II BISS CPSC BS   INFO TECH  XXX/XXX-1298
AXG9097 GREEN, ALEXANDER        SO-I  BISS CPSC BS   INFO TECH  XXX/XXX-0313
RJG6624 GREEN, ROBERT J.        SO-II BISS CPSC BS   COMP SCI   XXX/XXX-ZOZI
MWG1990 GREEN, MATTHEW W        SO-II BISS CPSC BS   INFO TECH  XXX/XXX-0581

The problem here is that not all the fields are the same size. Notice how Alexander Green (3rd from the bottom) does not have a middle initial. This prevents me from using awk uniformly on each column. My solution is to cut everything on the right side of the file so that the field delimiter won't mess everything up.

So how can I use the cut command to start at the right-most column and cut back 7 columns?


Solution

  • You can use cut as your data has fixed width fields.

    Here is what I got with the ocr'd text:

    $ cut -c 33-51,73-77 input
    JR-II BISS CPSC BS 9445
    SO-I  BISS CPSC BS 7993
    JR-II BISS CPSC BS 0437
    JR-I  BISS CPSC BS 2398
    FR-II BISS CPSC BS 7149
    JR-I  BISS CPSC BS OOOO
    SO-II BISS CPSC BS 4354
    JR-I  BISS CPSC BS 8268
    FR-II BISS CPSC BS 1298
    SO-I  BISS CPSC BS 0313
    SO-II BISS CPSC BS ZOZI
    SO-II BISS CPSC BS 0581
    

    and to match the requirement you wrote in a comment:

    Exactly what I'm trying to do is get the first character out of the columns that start (from the top entry) with JR, BISS, CPSC, INFO. Then I need the last 4 digits from the phone numbers on the right.

    $ cut -c 32-33,38-39,43-44,48-49,64-64,73-77 input
     J B C B 9445
     S B C B 7993
     J B C B 0437
     J B C B 2398
     F B C B 7149
     J B C B OOOO
     S B C B 4354
     J B C B 8268
     F B C B 1298
     S B C B 0313
     S B C B ZOZI
     S B C B 0581
    

    You'll need to adjust the ranges for your actual data.