Search code examples
mysqlunixmemoryawkcross-product

How to do a low RAM full cross join?


I have a hope to perform a full self-cross join on a large data file of points. However, I cannot use programming language to perform the operation because I cannot store it in memory. I would like to find all combinations of points within the set. Below would be an example of my dataset.

x y 
1 9 
2 8 
3 7 
4 6 
5 5 

I would like to cross join on this data to generate 25-row table containing all the combination of points. Would there be a low memory solution? perhaps with awk ?

Thank you,

Nicholas Hayden

P.S. I am a novice programmer.


Solution

  • perhaps in two steps, create a header, column1 and column2 files and join the column1 and column2 and append to header file

    awk 'NR==1{print > "cross"} NR>1 {print $1 > "col1"; print $2 > "col2"}' file
    join -j9 col1 col2 -o1.1,2.1 >> cross
    rm col1, col2
    

    obviously make sure the temp and final file names won't clash with the existing ones.

    Note, the join command on MacOS doesn't have the -j option, so change it to equivalent long form

    join -19 -29 col1 col2 -o1.1,2.1 >> cross
    

    in both alternatives we're asking join to use the non-existent 9th field as the key which matches every line of the first file to every line in the second to generate the cross product of the two files.