Search code examples
unixsedldif

Manipulating a huge text file to fetch occurrences of a particular field


I have a huge text file of the following format. I want to manipulate this file to fetch the number of occurrence of the department field. Each section has a field called department: As a result of my program, I need a CSV file of as mentioned in the Expected output section. I appreciate if the solution uses sed or head/tail or awk. The file is really huge. I have about 50,000+ lines of code. So an effective method is much appreciated.

Input format:


# Person1 Perosn2, AADDC Users, dummydata.somecompany.com
dn: CN=Person1 Perosn2,OU=AADDC Users,DC=dummydata,DC=somecompany,DC=com
objectClass: top
department: 234ABC
name: Person1 Perosn2
objectGUID:: MbCDVZpKbEWRxDUA5iN5IA==
userPrincipalName: [email protected]
objectCategory: CN=Person,CN=Schema,CN=Configuration,DC=dummydata,DC=somecompany
 ,DC=com
dSCorePropagationData: 16010101000000.0Z
lastLogonTimestamp: 132173602593105876
preferredLanguage: en-US
msDS-AzureADMailNickname: abcdef


# Person1 Perosn2, AADDC Users, dummydata.somecompany.com
dn: CN=Person1 Perosn2,OU=AADDC Users,DC=dummydata,DC=somecompany,DC=com
objectClass: top
department: 234ABC
name: Person1 Perosn2
objectGUID:: MbCDVZpKbEWRxDUA5iN5IA==
userPrincipalName: [email protected]
objectCategory: CN=Person,CN=Schema,CN=Configuration,DC=dummydata,DC=somecompany
 ,DC=com
dSCorePropagationData: 16010101000000.0Z
lastLogonTimestamp: 132173602593105876
preferredLanguage: en-US
msDS-AzureADMailNickname: abcdef

# Person3 Perosn4, AADDC Users, dummydata.somecompany.com
dn: CN=Person1 Perosn2,OU=AADDC Users,DC=dummydata,DC=somecompany,DC=com
objectClass: top
department: XYZ012
name: Person1 Perosn2
objectGUID:: MbCDVZpKbEWRxDUA5iN5IA==
userPrincipalName: [email protected]
objectCategory: CN=Person,CN=Schema,CN=Configuration,DC=dummydata,DC=somecompany
 ,DC=com
dSCorePropagationData: 16010101000000.0Z
lastLogonTimestamp: 132173602593105876
preferredLanguage: en-US
msDS-AzureADMailNickname: abcdef


Expected output

234ABC,2
XYZ012,1

what I did:

I used this command to grep the file. grep '^department: *' file.txt

But I am not sure if there is a way to get the expected output using single commands like sed, grep etc.


Solution

  • Could you please try following.

    awk '
    BEGIN{
      OFS=","
    }
    {
      gsub(/\r/,"")
    }
    /department:/{
      string=$NF
      sub(/ +$/,"",string)
      if(!a[string]++){
        b[++count]=string
      }
      ++val[string]
    }
    END{
      for(i=1;i<=count;i++){
        print b[i],val[b[i]]
      }
    }
    '  Input_file