Search code examples
algorithmmaxrowcell

How to replace entries with smaller values while keeping order?


What is an efficient algorithm to replace the values in an image while minimizing the largest value and maintaining order?

Background

I have a 8.5Gb image which is represented as a rows and columns.

Suppose we have a smaller version (there are no duplicates in input):

  4, 5, 9,
  2, 3, 7,
  8, 6, 1

I need to replace the entries at each pixel to the smallest positive value possible (greater than zero) in the entire matrix while preserving the row-wise and column-wise ordering. One possible output (duplicates allowed here) is the following and the maximum value is 5 ( I do not believe we can reduce it to 4):

  2, 3, 4,
  1, 2, 3,
  5, 4, 1

The reason it works:

Input: First Row: 4 < 5 < 9 and first Column: 4 > 2 < 8

Output: First Row: 2 < 3 < 4 and First Column 2 > 1 < 5 (column)

The orderings are being maintained. The same for the other rows and columns:

5 > 3 < 6  <=> 3 > 2 < 4
...
...

----------------------------------------- Attempt: My wrong algorithm -----------------------------------------

1. Each row and column will contain unique elements. So start with the first row and assign integers from the range {1, total the number of rows}:

1 2 3
x x x
x x x

The maximum in that row is currently at 3.

2. Go to the next row which is 2,3,7 and again assign numbers in the range {1, total number of rows}. When we assign 1 we look at all the previous rows if there are conflicts. In this case 1 is already present in the previous row. And we need a number which is smaller than 1. So place a zero there (I will offset every entries by on later).

 1 2 3 
 0 1 2
 * * *

The maximum in that row is currently 2.

3. Go to the next row and again fill as above. But 1 already occurred before and we need a number larger than the first and second rows: So, try 2. The next number needs to be larger than 2 and 1 (column) and smaller than 2 (row). That is a huge problem. I need to change too many cells each time.


Solution

  • For severe clarity, I'll add 10 to each of your values.

     Input      Ordering     
    14 15 19     - - -
    12 13 17     - - -
    18 16 11     - - -
    

    Consider each of the values in order, smallest to largest. Each element receives an ordering value that is the smallest integer available at that location. "Available" means that the assigned number is larger than any in the same row or column.

    11 and 12 aren't in the same row or column, so we can assign both of those immediately.

     Input      Ordering     
    14 15 19     - - -
    12 13 17     1 - -
    18 16 11     - - 1
    

    When we consider 13, we see that it is in the same row with a 1, so it must have the next larger value:

     Input      Ordering     
    14 15 19     - - -
    12 13 17     1 2 -
    18 16 11     - - 1
    

    14 has the same problem, being above a 1:

     Input      Ordering     
    14 15 19     2 - -
    12 13 17     1 2 -
    18 16 11     - - 1
    

    Continue this process for each number. Take the maximum of the orderings in that number's row and column. Add 1 and assign that ordering.

     Input      Ordering     
    14 15 19     2 3 -
    12 13 17     1 2 -
    18 16 11     - 4 1
    
     Input      Ordering     
    14 15 19     2 3 4
    12 13 17     1 2 3
    18 16 11     5 4 1
    

    There's a solution. The "dominance" path 18 > 16 > 15 > [14 or 13] > 12 demonstrates that 5 is the lowest max value.


    You can also solve this by converting the locations to a directed graph. Nodes in the same row or column have an edge connecting them; the edge is directed from the smaller to the larger. It will be sufficient to order the values and merely connect the adjacent values: given 14->15 and 15->19, we don't need 14->19 as well.

    Add a node 0 with label 0 and an edge to each node that has no other input edges.

    Now follow a typical labeling iteration: any node with all its inputs labeled receives a label that is one more than the largest of its inputs.

    This is the same algorithm as the above, but the correctness and minimalism are much easier to see.

    14 -> 15 -> 19
    12 -> 13 -> 17
    11 -> 16 -> 18
    12 -> 14 -> 18
    13 -> 15 -> 16
    11 -> 17 -> 19
    0 -> 11
    0 -> 12
    

    Now, if we shake out the topology of this, starting on the left, we get:

    0  11  13  17
       12  14  15  16  18
                   19
    

    This makes the numbering obvious: each node is labeled with the length of its longest path from the start node.


    Your memory problem should be edited into your question proposal, or given as a new question. You have non-trivial dependencies along rows and columns. If your data do not fit into memory, then you may want to make a disk-hosted data base to store your pre-processed data. For instance, you could store the graph as a list of edges keyed by dependencies:

    11  none
    12  none
    13  12
    14  12
    15  13, 14
    16  11, 15
    17  11, 13
    18  14, 16
    19  15, 17
    

    You haven't described the shape of your data. At the very worst, you should be able to build this graph data base with one pass to do the rows, and then one pass per column -- or multiple columns in each pass, depending on how many you can fit into memory at once.

    Then you can apply the algorithm to the items int he data base. You can speed it up if you keep in memory, not only all nodes with no dependencies, but another list with few dependencies -- "few" being dependent on your memory availability.

    For instance, make one pass over the data base to grab every cell with 0 or 1 dependencies. Put the independent nodes in your "active" list; as you process those, add nodes only from the "1-dependency" list as they're freed up. Once you've exhausted those sub-graphs, then make a large pass to (1) update the data base; (2) extract the next sets of nodes with 0 or 1 dependency.

    Let's look at this with the example you gave. First, we make a couple of lists from the original graph:

    0-dep  11, 12
    1-dep  13 (on 12), 14 (on 12)
    

    This pass is trivial: we assign 1 to cells 11 and 12; 2 to cells 13 and 14. Now update the graph:

    node dep done (assigned values) 15 none 2, 2 16 15 1 17 none 1, 2 18 16 2 19 15, 17

    Refresh the in-memory lists:

    0-dep  15, 17
    1-dep  16 (on 15), 18 (on 16)
    

    On this pass, both 15 and 17 depend on a node with value 2, so they are both assigned 3. Resolving 15 frees node 16, which gets value 4. This, in turn, frees up node 18, which gets the value 5.

    In one final pass, we now have node 19 with no outstanding dependencies. it's maximum upstream value is 3, so it gets the value 4.

    In the worst case -- you can't even hold all independent nodes in memory at once -- you can still grab as many as you can fit, assign their values in an in-memory pass, and return to the disk for more to process.

    Can you handle the data manipulations from here?