Search code examples
nlpspam-prevention

What is the TREC 2006 Spam Track Public Corpora Format?


link to original dataset
I have downloaded this dataset The TREC 2006 Public Corpus -- 75MB (trec06p.tgz). Here is the folder structure:

.
└── trec 06p/
    ├── data
    ├── data-delay
    ├── full
    ├── full-delay
    ├── ham25
    ├── ham25-delay
    ├── ham50
    ├── ham50-delay
    ├── spam25
    ├── spam25-delay
    ├── spam50
    └── spam50-delay

Some questions:

  1. What is the delay for? (e.g. data-delay, full-delay)
  2. What does full mean in this case? (is it just the labels?)
  3. What is the difference between HAM and ham in the full-delay subfolder?
  4. Why is the data-delay folder empty?
  5. Is there any special way to parse the contents in the data folder?

Solution

  • Disclaimer

    Before reading the answer, please note that since I had not participated in the TREC06 task nor am I the data creator/provider, I can do only some educated guess to the questions you have on the dataset.


    Educated Guessed Answers

    First, reading the task paper helps https://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf =)

    Next, the right download link for future readers would be https://plg.uwaterloo.ca/~gvcormac/treccorpus06/

    And now, some summary:

    • TREC 2006 Spam Track dataset is "a set of chronologically ordered email messages a spam filter for classification"
    • Four different forms of user feedback are modeled with
      • immediate feedback
        • the gold standard for each message is communicated to the filter immediately following classification;
      • delayed feedback
        • the gold standard is communicated to the filter sometime later (or potentially never), so as to model a user reading email from time to time and perhaps not diligently reporting the filter’s errors;
      • partial feedback
        • the gold standard for only a subset of email recipients is transmitted to the filter, so as to model the case of some users never reporting filter errors;
      • active on-line learning
        • the filter is allowed to request immediate feedback for a certain quota of messages which is considerably smaller than the total number.

    Q: How are the above forms of feedback represented by the files in the dataset?

    A: All the actual textual data are actually found in the trec06/data/**/* files

    /trec06p
      /data
        /000
           /000
           ...
           /299
        ...
        /126
           /000
           /021
    

    And for the rest of the directories, they are just a indices pointing to the subsets to emulate the different forms of evaluations.

    Q: What does full mean in this case? (is it just the labels?)

    • trec06p/full/index: The index of email lists that points to all the data points in trec06p/data/**/*

    Q: What is the delay for? (e.g. data-delay, full-delay)

    • trec06p/full-delay/index: The indices that points to the delayed feedback evaluation
      • trec06p/ham*-delay/index: The indices that points to only the non-spam labelled emails in the delayed feedback evaluation
      • trec06p/spam*-delay/index: The indices that points to only the spam labelled emails in the delayed feedback evaluation

    So essentially, all the unique list of trec06p/ham*-delay/index + trec06p/spam*-delay/index = trec06p/full-delay/index

    Q: Why is the data-delay folder empty?

    For this, I don't have an answer... Got to ask the data provider/creator.

    Q: Is there any special way to parse the contents in the data folder?

    Now that's the fun coding part =)

    Lets step back a little and think what we have essentially:

    • A list of emails in trec06/data/**/*
    • The spam/ham labels of each email in trec06/full/index
    • The Spam/SPAM/Ham/HAM labels of a subset of emails in trec06/full-delay/index

    So...

    import pandas as pd
    from tqdm import tqdm
    
    
    from lazyme import find_files
    
    
    data_rows = {}
    
    # Assuming you're on `trec06p` directory.
    # P/S: you can use any other file path list function, 
    # I just use lazyme.find_files because I find it convenient.
    for fn in tqdm(find_files('./data/**/*')):
        if fn.endswith('.DS_Store'):
            continue
        # Note that not all files are in utf8/ascii charset 
        # so you'll have to read them in binary to store them.
        # Also note: THIS CAN BE DANGEROUS IF THERE'S EXCUTABLES IN THE DATA!!!
        # Assuming that there isn't.
        with open(fn, 'rb') as fin:
            data_id = tuple(fn.split('/')[-2:])
            data_rows[data_id] = fin.read()
            
    full_labels = {}
    
    with open('./full/index') as fin:
        for line in tqdm(fin):
            label, fn = line.strip().split()
            data_id = tuple(fn.split('/')[-2:])
            full_labels[data_id] = label
            
            
    full_delay_labels = {}
    
    with open('./full-delay/index') as fin:
        for line in tqdm(fin):
            label, fn = line.strip().split()
            data_id = tuple(fn.split('/')[-2:])
            # You'll realize that the labels repeated per data point.
            # but they are exactly the same.... -_-
            if data_id in full_delay_labels:
                assert label.lower() == full_delay_labels[data_id].lower()
            full_delay_labels[data_id] = label.lower()
    

    Q: What is the difference between HAM, Ham, SPAM and Spam labels in the trec06p/*-delay/index

    If we look carefully at the if data_id in full_delay_labels: assert label.lower() == full_delay_labels[data_id].lower() line, we see that all the caps and the non-caps labels are the same.

    Q: So why is there a difference?

    A: Not sure, best to ask data provider/creator

    Q: Is there a difference between the labels from trec06p/full-delay/index and trec06p/full/index?

    Don't seem like there's any.

    >>> any(full_labels[data_id] != full_delay_labels[data_id] for data_id in full_labels)
    False
    

    Q: How do I just read it into a pandas dataframe?

    Given what we know above:

    import pandas as pd
    from tqdm import tqdm
    
    
    from lazyme import find_files
    
    
    data_rows = {}
    
    for fn in tqdm(find_files('./data/**/*')):
        if fn.endswith('.DS_Store'):
            continue
        with open(fn, 'rb') as fin:
            data_id = tuple(fn.split('/')[-2:])
            data_rows[data_id] = fin.read()
    
    full_labels = {}
    
    with open('./full/index') as fin:
        for line in tqdm(fin):
            label, fn = line.strip().split()
            data_id = tuple(fn.split('/')[-2:])
            full_labels[data_id] = label
            
    df = pd.DataFrame({'binary':pd.Series(data_rows),'label':full_labels})
    

    Q: But the input columns are still binaries, can I somehow guess the encoding?

    Not really, it's pretty hard / messy to guess the encoding of a binary file but you can try this (though not all file specify charset=... in the content)

    import re, mmap
    
    def find_charset(fn):
        with open(fn, 'rb') as f:
            view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
            return re.split(";|,|\n",
                            next(
                                re.finditer(br'charset\=([!-~\s]{%i,})\n' % 5, view)).group(1).decode('utf8')
                    )[0].strip('"').strip("'")