c++design-patterns filter decorator proxy-pattern

Design for customizable string filter

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:

data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00

For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.

I want to write an interface to do something like this:

dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");                     
// don't to the filtering, be lazy
cout << dataset.count();                          // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS");         // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();

I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done. I'm just learning something about design pattern, and I think I can use:

an abstract base class AbstractFilter that implement filter* methods
factory to decide from the called method which decorator use
every time I call a filter* method I return a decorated class, for example:

AbstractFilter::filter_run(string arg) {
    decorator = factory.get_decorator_run(arg);  // if arg is "> 00179691" returns FilterRunGreater(00179691)
    return decorator(this);
}

proxy that build a regex to filter the filenames, but don't do the filtering

I'm also learning jQuery and I'm using a similar chaining mechanism.

Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.

Solution

I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).

As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.

But, lets say you read all the matching directory entries into an array A.

Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.

There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:

if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.

Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.

I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.