Search code examples
pyarrowapache-arrow

ChunkedArray.Index on secondary column of table not working | ArrowTypeError


I am currently trying to implement the shortest path algorithm using pyarrow (first step for unweighted Graphs, second step for weighted graphs).

I am currently having an issue with the part where I need to verify if the target node is among the neighbors of the current node.

My data looks like this: enter image description here I have three columns: node, neighboring nodes and visited. The node column contains the name of each node in the graph. The neighboring nodes column contains an array of the names of the nodes that are directly connected to the node. The visited column contains a boolean value that indicates whether the node has been visited or not during a traversal algorithm.

In my example, I set the start node as 12160432, to obtain the neighboring nodes, I used the pc.filter function to retrieve the table in the red circle shown below

enter image description here

The next step would be to check if we reached the Target node, otherwise I will have to check the neighbors of the neighbors of my current node.

To check if the target is in the array I wanted to use the following functionIndex Function, Chunked Array as follows:

filtered_graph['neighboring_nodes'].index(10000001)

but I got the following error: "ArrowTypeError: Could not convert 10000001 with type int: was not a sequence or recognized null for conversion to list type"

target_node = pa.scalar(10000001, type=pa.int64())
filtered_graph['neighboring_nodes'].index(target_node)

but got the same error.

Note: When using the "node" column, the index function works as intended: enter image description here (-1 means value not found)

I appreciate any guidance you can offer !


Solution

  • Thank you 0x26res for your response ! following your logic I also found a little neat trick:

    import pyarrow as pa
    import pyarrow.compute as pc
    
    table = pa.table({'node': [1,4], 'neighbors': [[2,3], [5,6]]})
    
    flat_neighbors = pc.list_flatten(table['neighbors'])
    values_to_check = pa.array([2])
    
    mask = pc.is_in(flat_neighbors, value_set=values_to_check)
    pc.any(mask) returns <pyarrow.BooleanScalar: True>