I want to monitor a specific folder. Every new file in this folder should be scanned for URLs. These URLs should be edited, if the domain is not in a defined whitelist.
Example:
blabla http://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
Whitelist:
http://www.white.com
Result:
blabla httx://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
What i have tried so far is iwatch with this xml:
<?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >
<config>
<guard email="root@localhost" name="IWatch"/>
<watchlist>
<title>URL_Filter</title>
<contactpoint email="admin@test.com" name="Administrator"/>
<path type="single" syslog="on" alert="off" events="create" exec="sed -i 's/http/httx' %f">/var/test</path>
</watchlist>
</config>
So with iwatch i can observe the folder "/var/test" for new files. With the sed command i can replace every "http" with "httx". But i have no idea how i could put in a whitelist so that some URLs are not replaced...
--- edit --- Additional information: I want to edit all incoming postfix mails, so that there are no clickable links in it, except some domains, which are on the whitelist. The reason for that is to protect against phishing mails.
Return-Path: <example@gmail.com>
X-Original-To: example@test.de
Delivered-To: example@test.de
Received: from mail-lf0-x236.google.com (mail-lf0-x236.google.com [IPv6:2a00:1450:4010:c07::236])
by xxxxxxx.hosteurope.de (Postfix) with ESMTPS id D255223CB59
for <example@test.de>; Mon, 11 Apr 2016 14:44:10 +0200 (CEST)
Received: by mail-lf0-x236.google.com with SMTP id c126so154788483lfb.2
for <example@test.de>; Mon, 11 Apr 2016 05:39:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20120113;
h=mime-version:date:message-id:subject:from:to;
bh=WwH+NIkCWDEoIkwbeCI4pf0jP0ya/ctbQ81pUsA4G7s=;
b=ZS3Uo/cpVGNw3k38Js2+/DxVda0y2136oy4D4hsR0G25x2UjhyVU/yUcPl6qEdxt8i
CQXZHQbaf8pzCdDaSq4VL9RC/sIgZy3PQzj6Cyrp3WTi6SMmQ65NwNBWLVGnpPcuzNW1
IGC5N3rjj96ndYUAxia/tTcBX7ajS3Tw9Mc8yIaO13hSXMUCrTDIFZNzHR1ib7tLDpmX
6EVyFhquhIfJVOhcuPgWUUxHly/FmZ++ucoHR0Yozj+dc1GJ6/ZYzUAPdGICelDY7ieG
nvA7KH6+v6/zoWlbfkO9BmGzAPs6M4LGHilOjpMf/09Z2oMiV/WRDxe0WrCebQptpm2c
xHPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20130820;
h=x-gm-message-state:mime-version:date:message-id:subject:from:to;
bh=WwH+NIkCWDEoIkwbeCI4pf0jP0ya/ctbQ81pUsA4G7s=;
b=hAOSzKjertcsQIT/PHoZKsiKxLba8gaKOCmyNg7nmiPJjCWqobNvM5nf3sZP1Xhysi
gGdvk9mmMugII8dsjc7mRhDkbCT1QKVz/0UBQ+CaP6sK7kGdWfdarphGgzUGA6Il5JZi
lP4DpEQHUpG1wJ1r+dN2f+UT8tyfIwapXwo3g7FnkPLxmCq9CeqJeRlagL6vAacon8z7
CjdTHB7fzEtYToSp+cDi3+yK4zS9p4rwF4H4Ds3bJqwM/PrcFJW0YYncDHdra5TwYf6U
K6VRX19iUhQT4kTVFCtoNW9SU8Ri+Rc5VfvVTKRh4KwZ2uW5x8y07ucB0vZcAQdEnms4
AWnQ==
X-Gm-Message-State: AD7BkJJEDmk9P+Kzcn1MT4lQxpU1aYU6x8uABSpohCbT7EeOFAXjT1y6n3sFcRj7tcfWc6eBAOL6bJ78jvVOlQ==
MIME-Version: 1.0
X-Received: by 10.112.63.196 with SMTP id i4mr8426739lbs.93.1460378359811;
Mon, 11 Apr 2016 05:39:19 -0700 (PDT)
Received: by 10.114.66.51 with HTTP; Mon, 11 Apr 2016 05:39:19 -0700 (PDT)
Date: Mon, 11 Apr 2016 14:39:19 +0200
Message-ID: <CADF5gVU+C4BZCSFSiWeiBipBnDu5jTU+FVmLJbSQSbtMM9JZcQ@mail.gmail.com>
Subject: test
From: Example <example@gmail.com>
To: example@test.de
Content-Type: multipart/alternative; boundary=001a1133d4405fd878053034d55a
X-Scanned-By: MIMEDefang 2.71 on 5.38.258.144
--001a1133d4405fd878053034d55a
Content-Type: text/plain; charset=UTF-8
http://www.example.com
http://www.white.com
--001a1133d4405fd878053034d55a
Content-Type: text/html; charset=UTF-8
<div dir="ltr"><div><a href="http://www.example.com">http://www.example.com</a><br></div><a href="http://www.white.com">http://www.white.com</a><br></div>
--001a1133d4405fd878053034d55a--
Just realized the bash
script is un-necessary, we can do it using the following one-liner but it's really cryptic to read:
Input data:
$ cat data
sdfsdfsdfsdf http://www.whitedomain.com/red.html
bla http://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
$ cat whitelist
http://www.white.com
http://www.whitedomain.com
$
Final Output:
$ sed -r '/'"$(sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|')"'/! s/http/httx/g' data
sdfsdfsdfsdf http://www.whitedomain.com/red.html
bla httx://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
$
Explanation:
Output of inner subshell command is a regex(to filter out lines during sed
substitution command)
$ sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|'
http:\/\/www\.white\.com|http:\/\/www\.whitedomain\.com
Flow:
sed
and then piping it to paste
to add alternationssed
command to filter out lines not having any of the whitelist domains and using those lines for substitution of http
into httx
Edit1: Since sed
is line oriented you will have to transform the data into lines of text like this:
$ cat data1
<div dir="ltr"><div><a href="http://www.white.com">http://www.white.com</a><br></div><a href="http://www.example.com">http://www.example.com</a><br></div>
$ cat whitelist
http://www.white.com
http://www.whitedomain.com
$ sed 's/</\n</g' data1 | sed -r '/'"$(sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|')"'/! s/http/httx/g'
<div dir="ltr">
<div>
<a href="http://www.white.com">http://www.white.com
</a>
<br>
</div>
<a href="httx://www.example.com">httx://www.example.com
</a>
<br>
</div>
$