Search code examples
rustedit-distance

Calculating the distance of a sequence with adjacent transpositions, and no insertion or deletion


How can I calculate the numeric distance of two sequences under the requirement that sequence elements may only be swapped with adjacent elements? Insertion, and deletion are not allowed. The distance function must determine the number of swap operations necessary to reach one string from the other. Furthermore, the distance function is not defined for sequences of different lengths, or if the sequences don't share the same symbols. Lastly, you can assume that there are no duplicates in a sequence.

Based on my research, I would need something like the Damerau–Levenshtein distance, but I don't know how to modify the distance function correctly.

Example

For the sequence ABC, the only two valid transformations with distance 1 are BAC or ACB.

Code

Rust Playground

use std::collections::HashSet;
use std::hash::Hash;

fn dist<T: PartialEq + Eq + Hash>(lhs: &[T], rhs: &[T]) -> Option<usize> {
    // Distance function is not defined for differing lengths
    if lhs.len() != rhs.len() {
        return None;
    }

    let lhs_c: HashSet<_> = lhs.iter().collect();
    let rhs_c: HashSet<_> = rhs.iter().collect();

    // Distance function is not defined if an element occurs more than once in each sequence
    if lhs.len() != lhs_c.len() || rhs.len() != rhs_c.len() {
        return None;
    }

    // Distance function is not defined if the sequences don't share all elements
    if lhs_c != rhs_c {
        return None;
    } 

    // Calculate the edit distance of `lhs` and `rhs` with only adjacent transpositions.
    todo!()
}

#[test]
fn dist_is_none_for_differing_lengths() {
    assert_eq!(None, dist(&["b", "a", "c"], &["a", "b", "c", "d"]));
}

#[test]
fn dist_is_none_with_duplicates() {
    assert_eq!(None, dist(&["b", "a", "c", "a"], &["a", "b", "c", "d"]));
    assert_eq!(None, dist(&["a", "b", "c", "d"], &["a", "b", "c", "c"]));
}

#[test]
fn dist_is_none_with_non_empty_difference() {
    assert_eq!(None, dist(&["a", "b", "c"], &["d", "e", "f"]));
}

#[test]
fn dist_bac_to_abc_is_one() {
    assert_eq!(Some(1), dist(&["b", "a", "c"], &["a", "b", "c"]));
}

#[test]
fn dist_bca_to_bac_is_one() {
    assert_eq!(Some(1), dist(&["b", "c", "a"], &["b", "a", "c"]));
}

#[test]
fn dist_cab_to_abc_is_two() {
    assert_eq!(Some(2), dist(&["c", "a", "b"], &["a", "b", "c"]));
}

#[test]
fn dist_cdab_to_abcd_is_four() {
    assert_eq!(Some(4), dist(&["c", "d", "a", "b"], &["a", "b", "c", "d"]));
}

Solution

  • You can create a mapping from the elements to numbers in such a way that when it is applied to one of the sequences, it will be sorted. Then you can apply this mapping to the other sequence. After this, you can calculate the number of swaps required to bubble sort the second sequence: (plaground)

    use std::collections::HashMap;
    use core::hash::Hash;
    use std::collections::HashSet;
    
    fn create_mapping<T: Eq + Hash>(seq: &[T]) -> HashMap<&T, usize> {
        let mut ret = HashMap::new();
        for (i, elem) in seq.iter().enumerate() {
            ret.insert(elem, i);
        }
        ret
    }
    
    fn dist<T: Eq + Hash>(lhs: &[T], rhs: &[T]) -> Option<usize> {
        // Distance function is not defined for differing lengths
        if lhs.len() != rhs.len() {
            return None;
        }
        let lhs_c: HashSet<_> = lhs.iter().collect();
        let rhs_c: HashSet<_> = rhs.iter().collect();
        // Distance function is not defined if an element occurs more than once in each sequence
        if lhs.len() != lhs_c.len() || rhs.len() != rhs_c.len() {
            return None;
        }
        // Distance function is not defined if the sequences don't share all elements
        if lhs_c != rhs_c {
            return None;
        } 
        // Calculate the edit distance of `lhs` and `rhs` with only adjacent transpositions.
        let mapping = create_mapping(lhs);
        let rhs_mapped: Vec<usize> = rhs.iter().map(|e|mapping[e]).collect();
        Some(bubble_sort_length(&rhs_mapped))
    }
    
    fn bubble_sort_length(seq: &[usize]) -> usize {
        let mut ret = 0;
        for i in 0..seq.len() {
            for j in (i+1)..seq.len() {
                if seq[i]>seq[j] {
                    ret += 1;
                }
            }
        }
        ret
    }
    
    #[test]
    fn dist_is_none_for_differing_lengths() {
        assert_eq!(None, dist(&["b", "a", "c"], &["a", "b", "c", "d"]));
    }
    
    #[test]
    fn dist_is_none_with_duplicates() {
        assert_eq!(None, dist(&["b", "a", "c", "a"], &["a", "b", "c", "d"]));
        assert_eq!(None, dist(&["a", "b", "c", "d"], &["a", "b", "c", "c"]));
    }
    
    #[test]
    fn dist_is_none_with_non_empty_difference() {
        assert_eq!(None, dist(&["a", "b", "c"], &["d", "e", "f"]));
    }
    
    #[test]
    fn dist_bac_to_abc_is_one() {
        assert_eq!(Some(1), dist(&["b", "a", "c"], &["a", "b", "c"]));
    }
    
    #[test]
    fn dist_bca_to_bac_is_one() {
        assert_eq!(Some(1), dist(&["b", "c", "a"], &["b", "a", "c"]));
    }
    
    #[test]
    fn dist_cab_to_abc_is_two() {
        assert_eq!(Some(2), dist(&["c", "a", "b"], &["a", "b", "c"]));
    }
    
    #[test]
    fn dist_cdab_to_abcd_is_four() {
        assert_eq!(Some(4), dist(&["c", "d", "a", "b"], &["a", "b", "c", "d"]));
    }
    

    This algorithm will have the time complexity of O(n²) where n is the length of the sequences. There is probably some better way to implement the bubble_sort_length function (see https://www.geeksforgeeks.org/inversion-count-in-array-using-merge-sort/ which claims to have time complexity O(n*log(n)))