Answer: Thanks to Jerry Jeremiah I have the solution the end result is this:
grep -E '^\S{8} \S' test.lst | awk -F';' '{print substr($1,1,35)gensub("[[:space:]]+"," ","g",substr($1,36));}'
It requires having gawk installed
Original Question: I have a file which i want to sanitise the output and then diff however i'm having problems coming up with working regex to do what i want
Basically i want to ignore the first 36 characters then after that start with the first non white space character and replace all multiple white spaces with a single space and strip and line comment off the end which starts with a ; and remove any trailing whitespace
I just cant figure out how to get a pattern that works while ignoring those first 36 characters, any time i use a capture group like (\S*([^\s]\s+))* it will only ever return the last match
This is an example of the code i'm grepping into sed:
00000000 =00A00000 z80_ram: equ $A00000 ; start of Z80 RAM
00000000 =00A000EA z80_dac3_pitch: equ $A000EA
00000000 =00A01FFD z80_dac_status: equ $A01FFD
00000000 =00A01FFF z80_dac_sample: equ $A01FFF
00000000 =00A02000 z80_ram_end: equ $A02000 ; end of non-reserved Z80 RAM
00000000 =00A10001 z80_version: equ $A10001
00000000 =00A10002 z80_port_1_data: equ $A10002
00000000 =00A10008 z80_port_1_control: equ $A10008
00000000 =00A1000A z80_port_2_control: equ $A1000A
00000000 =00A1000C z80_expansion_control: equ $A1000C
00000000 =00A11100 z80_bus_request: equ $A11100
00000000 =00A11200 z80_reset: equ $A11200
00000000 =00A04000 ym2612_a0: equ $A04000
00000000 =00A04001 ym2612_d0: equ $A04001
00000000 =00A04002 ym2612_a1: equ $A04002
00000000 =00A04003 ym2612_d1: equ $A04003
00000000 =00A14000 security_addr: equ $A14000
00000214 6600 bne.s SkipSetup ; Skip the VDP and Z80 setup code if port A, B or C is ok...?
00000216 4BFA 0000 lea SetupValues(pc),a5 ; Load setup values array address.
0000021A 4C9D 00E0 movem.w (a5)+,d5-d7
0000021E 4CDD 1F00 movem.l (a5)+,a0-a4
00000222 1029 EF01 move.b -$10FF(a1),d0 ; get hardware version (from $A10001)
00000226 0200 000F andi.b #$F,d0
0000022A 6700 beq.s SkipSecurity ; If the console has no TMSS, skip the security stuff.
0000022C 237C 5345 4741 2F00 move.l #'SEGA',$2F00(a1) ; move "SEGA" to TMSS register ($A14000)
The output I want is this:
00000000 =00A00000 z80_ram: equ $A00000
00000000 =00A000EA z80_dac3_pitch: equ $A000EA
00000000 =00A01FFD z80_dac_status: equ $A01FFD
00000000 =00A01FFF z80_dac_sample: equ $A01FFF
00000000 =00A02000 z80_ram_end: equ $A02000
00000000 =00A10001 z80_version: equ $A10001
00000000 =00A10002 z80_port_1_data: equ $A10002
00000000 =00A10008 z80_port_1_control: equ $A10008
00000000 =00A1000A z80_port_2_control: equ $A1000A
00000000 =00A1000C z80_expansion_control: equ $A1000C
00000000 =00A11100 z80_bus_request: equ $A11100
00000000 =00A11200 z80_reset: equ $A11200
00000000 =00A04000 ym2612_a0: equ $A04000
00000000 =00A04001 ym2612_d0: equ $A04001
00000000 =00A04002 ym2612_a1: equ $A04002
00000000 =00A04003 ym2612_d1: equ $A04003
00000000 =00A14000 security_addr: equ $A14000
00000214 6600 bne.s SkipSetup
00000216 4BFA 0000 lea SetupValues(pc),a5
0000021A 4C9D 00E0 movem.w (a5)+,d5-d7
0000021E 4CDD 1F00 movem.l (a5)+,a0-a4
00000222 1029 EF01 move.b -$10FF(a1),d0
00000226 0200 000F andi.b #$F,d0
0000022A 6700 beq.s SkipSecurity
0000022C 237C 5345 4741 2F00 move.l #'SEGA',$2F00(a1)
You may use awk
like:
awk -F';' '{a=substr($1,1,35); b=substr($1,36); gsub("[[:space:]]+"," ",b);print a b;}' file > outfile
See an online awk
demo
Details
-F';'
- field separator set to ;
a=substr($1,1,35)
- set an a
variable equal to a (1,35) char substring of Field 1b=substr($1,36)
- set a b
variable equal to a (36,) char substring of Field 1gsub("[[:space:]]+"," ",b)
- replace all chunks of 1 or more whitespace chars with a single regular space char in the b
variable onlyprint a b
- print concatenated a
and b
variable values.