Search code examples
bashperlubuntusedpostfix-mta

Find and replace URLs in postfix files - Linux/Ubuntu


I want to monitor a specific folder. Every new file in this folder should be scanned for URLs. These URLs should be edited, if the domain is not in a defined whitelist.

Example:

blabla http://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html

Whitelist:

http://www.white.com

Result:

blabla httx://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html

What i have tried so far is iwatch with this xml:

<?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >
<config>
  <guard email="root@localhost" name="IWatch"/>
  <watchlist>
    <title>URL_Filter</title>
    <contactpoint email="admin@test.com" name="Administrator"/>
    <path type="single" syslog="on" alert="off" events="create" exec="sed -i 's/http/httx' %f">/var/test</path>
  </watchlist>
</config>

So with iwatch i can observe the folder "/var/test" for new files. With the sed command i can replace every "http" with "httx". But i have no idea how i could put in a whitelist so that some URLs are not replaced...

--- edit --- Additional information: I want to edit all incoming postfix mails, so that there are no clickable links in it, except some domains, which are on the whitelist. The reason for that is to protect against phishing mails.

Return-Path: <example@gmail.com>
X-Original-To: example@test.de
Delivered-To: example@test.de
Received: from mail-lf0-x236.google.com (mail-lf0-x236.google.com [IPv6:2a00:1450:4010:c07::236])
        by xxxxxxx.hosteurope.de (Postfix) with ESMTPS id D255223CB59
        for <example@test.de>; Mon, 11 Apr 2016 14:44:10 +0200 (CEST)
Received: by mail-lf0-x236.google.com with SMTP id c126so154788483lfb.2
        for <example@test.de>; Mon, 11 Apr 2016 05:39:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:date:message-id:subject:from:to;
        bh=WwH+NIkCWDEoIkwbeCI4pf0jP0ya/ctbQ81pUsA4G7s=;
        b=ZS3Uo/cpVGNw3k38Js2+/DxVda0y2136oy4D4hsR0G25x2UjhyVU/yUcPl6qEdxt8i
         CQXZHQbaf8pzCdDaSq4VL9RC/sIgZy3PQzj6Cyrp3WTi6SMmQ65NwNBWLVGnpPcuzNW1
         IGC5N3rjj96ndYUAxia/tTcBX7ajS3Tw9Mc8yIaO13hSXMUCrTDIFZNzHR1ib7tLDpmX
         6EVyFhquhIfJVOhcuPgWUUxHly/FmZ++ucoHR0Yozj+dc1GJ6/ZYzUAPdGICelDY7ieG
         nvA7KH6+v6/zoWlbfkO9BmGzAPs6M4LGHilOjpMf/09Z2oMiV/WRDxe0WrCebQptpm2c
         xHPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:mime-version:date:message-id:subject:from:to;
        bh=WwH+NIkCWDEoIkwbeCI4pf0jP0ya/ctbQ81pUsA4G7s=;
        b=hAOSzKjertcsQIT/PHoZKsiKxLba8gaKOCmyNg7nmiPJjCWqobNvM5nf3sZP1Xhysi
         gGdvk9mmMugII8dsjc7mRhDkbCT1QKVz/0UBQ+CaP6sK7kGdWfdarphGgzUGA6Il5JZi
         lP4DpEQHUpG1wJ1r+dN2f+UT8tyfIwapXwo3g7FnkPLxmCq9CeqJeRlagL6vAacon8z7
         CjdTHB7fzEtYToSp+cDi3+yK4zS9p4rwF4H4Ds3bJqwM/PrcFJW0YYncDHdra5TwYf6U
         K6VRX19iUhQT4kTVFCtoNW9SU8Ri+Rc5VfvVTKRh4KwZ2uW5x8y07ucB0vZcAQdEnms4
         AWnQ==
X-Gm-Message-State: AD7BkJJEDmk9P+Kzcn1MT4lQxpU1aYU6x8uABSpohCbT7EeOFAXjT1y6n3sFcRj7tcfWc6eBAOL6bJ78jvVOlQ==
MIME-Version: 1.0
X-Received: by 10.112.63.196 with SMTP id i4mr8426739lbs.93.1460378359811;
 Mon, 11 Apr 2016 05:39:19 -0700 (PDT)
Received: by 10.114.66.51 with HTTP; Mon, 11 Apr 2016 05:39:19 -0700 (PDT)
Date: Mon, 11 Apr 2016 14:39:19 +0200
Message-ID: <CADF5gVU+C4BZCSFSiWeiBipBnDu5jTU+FVmLJbSQSbtMM9JZcQ@mail.gmail.com>
Subject: test
From: Example <example@gmail.com>
To: example@test.de
Content-Type: multipart/alternative; boundary=001a1133d4405fd878053034d55a
X-Scanned-By: MIMEDefang 2.71 on 5.38.258.144

--001a1133d4405fd878053034d55a
Content-Type: text/plain; charset=UTF-8

http://www.example.com
http://www.white.com

--001a1133d4405fd878053034d55a
Content-Type: text/html; charset=UTF-8

<div dir="ltr"><div><a href="http://www.example.com">http://www.example.com</a><br></div><a href="http://www.white.com">http://www.white.com</a><br></div>

--001a1133d4405fd878053034d55a--

Solution

  • Just realized the bash script is un-necessary, we can do it using the following one-liner but it's really cryptic to read:

    Input data:

    $ cat data
    sdfsdfsdfsdf http://www.whitedomain.com/red.html
    bla http://www.black.com/green/yellow.html blabla
    sdfsdfsdfsdf http://www.white.com/red.html
    $ cat whitelist 
    http://www.white.com
    http://www.whitedomain.com
    $
    

    Final Output:

    $ sed -r '/'"$(sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|')"'/! s/http/httx/g' data
    sdfsdfsdfsdf http://www.whitedomain.com/red.html
    bla httx://www.black.com/green/yellow.html blabla
    sdfsdfsdfsdf http://www.white.com/red.html
    $
    

    Explanation:

    Output of inner subshell command is a regex(to filter out lines during sed substitution command)

    $ sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|'
    http:\/\/www\.white\.com|http:\/\/www\.whitedomain\.com
    

    Flow:

    1. form the regex dynamically using inner subshell command escaping all meta characters in sed and then piping it to paste to add alternations
    2. Using the above output in the sed command to filter out lines not having any of the whitelist domains and using those lines for substitution of http into httx

    Edit1: Since sed is line oriented you will have to transform the data into lines of text like this:

    $ cat data1 
    <div dir="ltr"><div><a href="http://www.white.com">http://www.white.com</a><br></div><a href="http://www.example.com">http://www.example.com</a><br></div>
    $ cat whitelist 
    http://www.white.com
    http://www.whitedomain.com
    $ sed 's/</\n</g' data1 | sed -r '/'"$(sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|')"'/! s/http/httx/g'
    
    <div dir="ltr">
    <div>
    <a href="http://www.white.com">http://www.white.com
    </a>
    <br>
    </div>
    <a href="httx://www.example.com">httx://www.example.com
    </a>
    <br>
    </div>
    $