Search code examples
kdbq-lang

Find the count of word pairs in kdb+


I have a file which contain multiple rows of item codes as follows. There are 1 million rows similar to these

  1.  123,134,256,345,789.....
  2.  123,256,345,678,789......
   .
   .  

I would like to find the count of all the pair of words/items per row in the file using q in kdb+. i.e. any two pair of words that occur in the same row can be considered a word pair. e.g:

(123,134),(123,256),(134,256), (123,345) (123,789), (134,789) are some of the word pairs in row 1 (123,256),(123,345),(123,345),(678,789),(345,789) are some of the word pairs in row 2

word/item pair count  

 `123,134----1 
  123,256---2
  345,789---2`

I am reading the file using read0 and have been able to convert each line into list using vs and using count each group to count the number of words, but now I want to find the count of all the word pairs per row in the file.

Thanks in advance for your help


Solution

  • I'm not 100% I understand your definition of a word-pair. Perhaps you could expand a little if my logic doesn't match what you were looking for.

    In the example below, I've created a 5x5 matrice of symbols for testing - selected distinct pairs of values from each row, and then checked how many rows each of these appeared in, in total.

    Please double check with your own results.

    q)test:5 cut`$string 25?5
    
    q)test
    2 0 1 0 0
    2 4 4 2 0
    1 0 0 3 4
    2 1 1 4 4
    3 0 3 4 0
    
    q)count each group raze {l[where(count'[l:distinct distinct each asc'[x cross x:distinct x]])>1]} each test
    0 2| 2
    1 2| 2
    0 1| 2
    2 4| 2
    0 4| 3
    1 3| 1
    1 4| 2
    0 3| 2
    3 4| 2