Search code examples
machine-learningcosine-similarity

Can similarity detection techniques be applied to a text document formatted as a big ASCII encoded integer byte array?


I would like to detect similarities between files. One way of doing this is to encode the file in order to reduce the input space to the similarity algorithm and second to have more accurate results for the result. This is done by taking into account only informative features of the document. One way of doing this is to transform the file into a vector space transformation following the tf-idf frequence which scales up terms that are very informative and scales down frequent terms. My question is whether this can be done in a document where its text representation is not preserved. Suppose for instance that first the document is transformed into a big integer array where its character is represented as its ASCII value.


Solution

  • The encoding of your document in byte array is not a problem as @Ivan Koblik pointed out since texts are always encoded with numeric data. Your task is a standard document similarity detection problem. I suggest the below steps:

    Pre-processing

    • decoding
    • tokenization;
    • case folding;
    • removing stop words;
    • stemming.

    Generating features

    • as you pointed out, tf-idf might be a good one.
    • a set of weighted features constitutes a high-dimensional vector.

    Fingerprinting

    Using simhash or , you can transform the high-dimensional vector into an f-bit fingerprint, e.g., f=64. This is exactly the technique used by Google for near duplicate web page detection.

    We maintain an f-dimensional vector V , each of whose dimensions is initialized to zero. A feature is hashed into an f-bit hash value. These f bits (unique to the feature) increment/decrement the f components of the vector by the weight of that feature as follows: if the i-th bit of the hash value is 1, the i-th component of V is incremented by the weight of that feature; if the i-th bit of the hash value is 0, the i-th component of V is decremented by the weight of that feature. When all features have been processed, some components of V are positive while others are negative. The signs of components determine the corresponding bits of the final fingerprint.

    However, in your case, you might found other similarity functions like cosine distance or euclidean distance perform better. Try to find the best similarity function for your problem if simhash does not work well for you.

    Query for similar documents

    After you generate the fingerprint for each document, similar documents should have similar fingerprints (similar means their fingerprints' hamming distance is small). For more details, please refer to near duplicate web page detection.

    Edit

    If you cannot do the first step, i.e., decoding, simply counting the occurrences of each unique integer is still doable. You can apply tf-idf to those unique integers and follow the subsequent steps.