Search code examples
mysqlsedbbcodephpbb

Trying to get `sed `to fix a phpBB board - but I cannot get my regexp to work


I have an ancient phpBB3 board which has gone through several updates over its 15+ years of existence. Sometimes, in the distant past, such updates would partially fail, leaving all sorts of 'garbage' in the BBCode. I'm now trying to do a 'simple' regexp to match a particular issue and fix it.

What happened was the following... during a database update, long long ago, BBCode tags were, for some reason, 'tagged' with a pseudo-attribute — allegedly for the database updating script to figure out each token that required updating, I guess. This attribute was always a 8-char-long alphanumeric string, 'appended' to the actual BBCode with a semicolon, like this:

[I]something in italic[/I]
...
[I:i9o7y3ew]something in italic[/I:i9o7y3ew]

Naturally, phpBB doesn't recognise this as valid BBCode, and just prints the whole text out.

The replacement regexp is actually very basic:

s/\[(\/?)(.+):[[:alnum:]]{0,8}\]/[\1\2]/gim

You can see a working example on regex110.com (where capture groups use $1 instead of \1). The example given there includes a few examples from the actual database itself. [i] is actually the simplest case; there are plenty of others which are perfectly valid but a bit more complex, thus requiring a (.+) matcher, such as [quote=\"Gwyneth Llewelyn\":2m80kuso].

As you can see from the example on regex110.com, this works :-)

Why doesn't it work under (GNU) sed? I'm using version 4.8 under Linux:

$ sed -i.bak -E "s/\[(\/?)(.+):[[:alnum:]]+\]/[\1\2]/gim" table.sql

Just for the sake of the argument, I tried using [A-Za-z0-9]+ instead of [[:alnum:]]+; I've even tried (.+) (to capture the group and then just discard it)

None produced an error; none did any replacements whatsoever.

I understand that there are many different regexp engines out there (PCRE, PCRE2, Boost, etc. and so forth) so perhaps sed is using a syntax that is inconsistent with what I'm expecting...?

Rationale: well, I could have done this differently; after all, MySQL has built-in regexp replacements, too. However, since this particular table is so big, it takes eternities. I thought I'd be far better off by dumping everything to a text file, doing the replacements there, and importing the table again. There is a catch, though: the file is 95 MBytes in size, which means that most tools I've got (e.g. editors with built-in regexp search & replace) will fail with such a huge exception. One notable exception is good old emacs, which has no trouble with such large files. Alas, emacs cannot match anything, so I thought I'd give sed a try (it should be faster, too). sed takes also close to a minute or so to process the whole file — about the same as emacs, in fact — and has the same result, i.e. no replacements are being made. It seems to me that, although the underlying technology is so different (pure C vs. Emacs-LISP), both these tools somehow rely on similar algorithms... both of which fail.

My understanding is that some libraries use different conventions to signal literal vs. metacharacters and quantifiers. Here is an example from an instruction manual for vim: http://www.vimregex.com/#compare

Indeed, contemporary versions of sed seem to be able to handle two different kinds of conventions (thus the -E flag). The issue I have with my regexp is that I find it very difficult to figure out which convention to apply. Let's start with what I'm used to from PHP, Go, JavaScript and a plethora of other regexp implementations, which use the convention that metacharacters & quantifiers do not get backslashed (while literals do).

Thus, \[(\/?)(.+):[[:alnum:]]+\] presumes that there are a few literal matches for [, ], /, and only these few cases require backslashes.

Using the reverse convention — i.e. literals do not get backslashed, while metacharacters and some quantifies do — this would be written as:

[\(/\?\)\(\.\+\):\[\[:alnum:\]\]\+]

Or so I would think.

Sadly, sed also rejects this with an error — and so do vim and emacs, BTW (they seem to use a similar regexp library, or perhaps even the same one).

So what is the correct way to write my regexp so that sed accepts it (and does what I intend it to do)?


UPDATE

I have since learned that, in the database, phpBB, unlike I assumed, does not store BBCode (!) but rather a variant of HTML (some tags are the same, some are invented on the spot). What happens is that BBCode gets translated into that pseudo-HTML, and back again when displaying; that, at least, explains why phpBB extensions such as Markdown for phpBB — but also BBCode add-ons! — can so easily replace, partially or even totally, whatever is in the database, which will continue to work (to a degree!) even if those extensions get deactivated: the parsed BBCode/Markdown is just converted to this 'special' styling in the database, and, as such, will always be rendered correctly by phpBB3, no matter what.

On other words, fixing those 'broken' phpBB tags requires a bit more processing, and not merely search & replace with a single regexp.

Nevertheless, my question is still pertinent to me. I'm not really an expert with regexps but I know the basics — enough to make my life so much easier! — and it's always good to understand the different 'dialects' used by different platforms.

Notably, instead of using egrep and/or grep -E, I'm fond of using ugrep instead. It uses PCRE2 expressions (with the Boost library), and maybe that's the issue I'm having with the sed engine(s) — the different engines speak different regular expressions dialect, and converting from one grep variant to a different one might not be useful at all (because some options will not 'translate' well enough)...


Solution

  • Using sed

    • (\[[^:]*) - Retain everything up to but not including the next semi colon after a opening bracket within the parenthesis which can later be returned with back reference \1
    • [^]]* - Exclude everything else up to but not including the next closing bracket
    $ sed -E 's/(\[[^:]*)[^]]*/\1/g' table.sql
    [I]something in italic[/I]
    ...
    [I]something in italic[/I]