Search code examples
batch-file

Batch to include/exclude text between nth one character and nth different character in all lines


I want to do this:

Use PowerShell to include/exclude text between nth one character and nth different character in all lines

but without PowerShell, that is, using Batch files (.bat, .cmd) from Command Prompt.

I have a text file with lots of lines like these:

BALL - A 5122-ABCD-STH-PC2016/A 5122 : It's a duplicate.
CIRCLE - B 612-DEFGH-STH-LAPTOP2005/B 612 : It's a duplicate.

What I want to do with a batch file, is to extract the text between the 3rd space and 3rd hyphen (don't include the delimiters), The 3rd hyphen goes after the 3rd space. Do this for all lines. Like this:

5122-ABCD
612-DEFGH

The sed, awk and cut equivalents are:

cat file.txt | sed -E 's/^([^ ]*[ ]){3}//' | sed -E 's/(^([^-]*[-]){1}[^-]*).*/\1/'
cat file.txt | awk -F' ' '{print $4}' | awk -F'-' '{print $1 "-" $2}'
cat file.txt | cut -d ' ' -f 4- | cut -d '-' -f -2

But ideally I should introduce the number for the nth occurrence of the space and the hyphen instead of "hard coding", like the @SantiagoSquarzon answer for PowerShell:

$file = $Args[0]
$text = Get-Content $file
$text | Select-String '(?m)(?<=^(?:\S+\s){3})(?:[^-]+-){1}[^-]+' -AllMatches |
ForEach-Object { $_.Matches.Value } |
Sort-Object -Unique

Solution

  • There could be used the following batch file for this task:

    @echo off
    setlocal EnableExtensions DisableDelayedExpansion
    if "%~1" == "" echo INFO: "%~nx0" must be called with a file name as argument.& goto EndBatch
    (for /F "usebackq tokens=4" %%G in ("%~1") do for /F "tokens=1,2 delims=-" %%H in ("%%G") do echo %%H-%%I)>"%TEMP%\%~n0_1.tmp"
    for %%G in ("%TEMP%\%~n0_1.tmp") do if %%~zG == 0 del "%TEMP%\%~n0_1.tmp" & goto EndBatch
    set "SortedFile=%TEMP%\%~n0_2.tmp"
    set "ResultsFile=%~2"
    if not defined ResultsFile set "ResultsFile=Results.txt"
    %SystemRoot%\System32\sort.exe "%TEMP%\%~n0_1.tmp" /O "%SortedFile%"
    setlocal EnableDelayedExpansion
    set "LastLine="
    (for /F "usebackq delims=" %%G in ("!SortedFile!") do (
        if not "!LastLine!" == "%%G" echo %%G
        set "LastLine=%%G"
    ))>"!ResultsFile!"
    endlocal
    del "%TEMP%\%~n0_?.tmp"
    :EndBatch
    endlocal
    

    The first two command lines define completely the required execution environment which is:

    1. command echo mode turned off and
    2. command extensions enabled and
    3. delayed variable expansion disabled.

    The third command line verifies that the batch file is called with an argument string interpreted next as file name of the file to process. There is output an informational message for the user of the batch file on running it without an argument string and then the batch file exits with final execution of command endlocal restoring the initial execution environment.

    The fourth line does the main job. There is first created by cmd.exe a temporary file with name of the batch file with _1 appended to the batch file name in the directory for temporary files with file extension .tmp. This file is kept open as long as the two FOR loops are processing the text file line by line for appending the data of interest to the temporary file.

    The outer FOR command line opens the text file of which file name is passed to the batch file as first argument string and processes it line by line. Empty lines are ignored by FOR. Each non-empty line is split up into substrings using normal space and horizontal tab as string delimiters as defined by default by FOR. The option tokens=4 instructs FOR assigning the fourth space/tab separated string to the specified environment variable G. Important to know is further that no line in the text file with 0 or more leading spaces/tabs should begin with a semicolon as in this case FOR would ignore also the line because of eol=; is the default end of line definition.

    The inner FOR processes the string assigned to the loop variable G by splitting it up into substrings using the hyphen character as delimiter as specified with the option delims=-. The option tokens=1,2 instructs FOR assigning the first and the second hyphen separated strings to the specified loop variable H and the next one according to the ASCII table which is here I.

    The two strings assigned to the loop variables H and I are output with a - between to standard output which cmd.exe redirects to the first temporary file opened with using a buffer for reducing the file system accesses.

    The fifth line checks the file size of the always created first temporary file. If the file size is 0, the passed argument string did not reference an existing file, or cmd.exe failed to open the file for reading the lines, or no data of interest could be found in this file at all. In this case the created temporary file is deleted and the batch file execution continues with execution of endlocal to restore the initial execution environment.

    Otherwise the lines in first temporary file are sorted alphabetically, not by the number left to the hyphen, and the sorted lines are written to a second temporary file created also in the directory for temporary files.

    The file name for the results file without or with path can be passed to the batch file as second argument strings. There is used Results.txt if the batch file is called without a file name for the results file.

    There is enabled now delayed variable expansion as required for the next FOR loop for the removal of duplicate lines. Please note that no line with data of interest should contain ! as otherwise the code as written would interpret the exclamation mark in the line as beginning/end of a delayed expanded variable reference resulting in a wrong processing of the line.

    The second temporary file with the sorted lines is processed now again line by line with a FOR loop with assigning each entire line to the loop variable G as long as not beginning with a semicolon because of the option delims= defines an empty list of string delimiters which turns off the line splitting behavior.

    The current line is output only if the current line assigned to the loop variable G is case-sensitive not equal to the last line assigned to the environment variable LastLine. Then the current line is assigned to the environment variable LastLine which does not modify the value of LastLine in case of a duplicate line.

    The output of the last FOR loop is written into the file Results.txt on calling the batch file without an explicit specified results file name created in the current working directory which of course can be any directory as defined by the process starting cmd.exe for processing the batch file. The output file Results.txt is again opened already by cmd.exe before the FOR loop is started and is kept open for all the buffered write operations done during the execution of the FOR loop.

    The usage of >>Results.txt inside a FOR loop is not good. The reason is that in this case cmd.exe must on each loop iteration

    1. open the file with creating it on not already existing,
    2. seeking to end of the file,
    3. write the output of command echo to the file at the end,
    4. close the file.

    That causes lots of file system accesses during the loop iterations in comparison to the used method opening the output file with creating it always new once before running the FOR loop, keep it open all the time and write the data output by echo using a buffer to the file, and finally close just once the file with flushing the buffer to the file. The posted solution prevents also other processes like anti-virus applications opening also the results file for scanning it for potential threads because of cmd.exe has the file opened all the time during the loop iterations with shared access for other process denied. A solution with >>Results.txt inside the FOR loop can result in an anti-virus application opens the results file between a file close and next file open for scanning the file contents for threads which prevents cmd.exe opening the file once again for appending the next output.

    The previous execution environment with disabled delayed expansion is restored with execution of command endlocal.

    The two temporary files created by the batch file in the directory for temporary files are deleted finally before exiting the batch file with one more execution of endlocal for restoring the initial execution environment.

    To understand the commands used and how they work, open a command prompt window, execute there the following commands, and read the displayed help pages for each command, entirely and carefully.

    • call /?
    • del /?
    • echo /?
    • endlocal /?
    • for /?
    • goto /?
    • if /?
    • set /?
    • setlocal /?
    • sort /?

    Read this answer for details about the commands SETLOCAL and ENDLOCAL.