Search code examples
windowspowershellreplace

Windows Powershell: Replace an expression in a text file only in lines that begin with a certain expression


I am working with Windows PowerShell and have very limited programming experience :( I have a text file (LaTeX bib file) that is filled with ~30.000 items of the following form:

@incollection{Abdulla.2022,
 author = {Abdulla, Sara M.},
 title = {{Underground anti-woman and incel movements and their connections to sexual assault}},
 pages = {3601--3626},
 publisher = {{Springer International}},
 editor = {Geffner, Robert and White, Jacquelyn W. and Hamberger, L. Kevin and Rosenbaum, Alan and Vaughan-Eden, Viola and Vieth, Victor I.},
 booktitle = {{Handbook of Interpersonal Violence and Abuse Across the Lifespan}},
 year = {2022},
 address = {Berlin and Oxford and New York},
 doi = {10.1007/978-3-319-89999-2_198},
 file = {Abdulla 2022 - Underground anti-woman and incel movements:Attachments/Abdulla 2022 - Underground anti-woman and incel movements.pdf:application/pdf}
}

In the line starting with address = {, I need to replace " and " by ", ", so that the line will read address = {Berlin, Oxford, New York}. However, I do not wish to replace " and " in the title or editor lines.

I know I'd simply use something like

(Get-Content lit.bib -Raw).Replace(' and ', ', ') | Out-File -encoding UTF8 lit.bib

to replace it in all lines, but how can I manage to do it only in lines starting with address = {? (Most likely an if-statement, but I don't know how to do it exactly.)

Thanks a lot!


Solution

    • Read the lines of the file one by one.

    • Match only the address lines of interest and perform the replacement only on them.

    (Get-Content -ReadCount 0 lit.bib) |
      ForEach-Object {
        if ($_ -match '^\s*address\b') { # line of interest
          $_ -replace ' and ', ', '
        }
        else { # other lines
          $_  # pass through
        }
      } | 
      Set-Content -Encoding utf8 -LiteralPath lit.bib
    

    Note:

    • -ReadCount 0 makes Get-Content read the file content at once, as a whole, into a single array (rather than streaming its content, i.e. sending each line to the pipeline one by one, as it is being read).

      • This noticeably improves performance,[1] but also requires enclosing the call in (...), the grouping operator, so as to force enumeration of the single array object that is output, so that ForEach-Object can act on each line.

      • In your case, enclosure in (...) would also be required without -ReadCount 0, given the intent to write back to the input file, which requires that the file be read in full, up front.

    • Regex ^\s*address\b is used with the -match operator to match only the lines of interest, i.e. lines that start with (^), zero or more (*) whitespace characters (\s), followed by the word address, ending at a word boundary (\b).

      • If you know that all lines start exactly with  address =, you can use literal matching, $_.StartsWith(' address = '), which is both conceptually simpler and faster; middle ground would be to use a wildcard expression with the -like operator:
        $_ -like ' address =*'
    • Because it is more PowerShell-idiomatic, the regex-based -replace operator is used above in lieu of the .NET string type's .Replace() method; however, the latter is faster, due to only performing literal replacements.

      • A fundamental difference between .NET methods and PowerShell's operators - including -match - is that they are case-insensitive by default; to make them case-sensitive, use their c-prefixed variants (-cmatch, -creplace).

    [1] In direct comparison, Get-Content lit.bib is much slower than Get-Content -ReadCount 0 lit.bib; in the case at hand, that advantage is diminished by the need to enumerate the array afterwards, i.e. to again send its elements one by one through the pipeline. However, the streaming scenario (without -ReadCount 0) is slowed down by another aspect: each line being read is decorated with ETS (PowerShell-specific) instance properties, which is costly - see GitHub issue #7537 for a discussion.
    Thus, when speed of execution matters, (Get-Content -ReadCount 0 ...) | ... is still preferable to
    Get-Content ... | ...