Search code examples
sqlrnetezza

Finding out Neighbors of a Node using SQL


I have the following table in SQL:

CREATE TABLE friendships 
(
    id INT PRIMARY KEY,
    person VARCHAR(50),
    friend VARCHAR(50)
);

INSERT INTO friendships (id, person, friend) VALUES
(1, 'person3', 'person9'),
(3, 'person10', 'person4'),
(4, 'person2', 'person1'),
(5, 'person6', 'person7'),
(7, 'person4', 'person10'),
(8, 'person6', 'person7'),
(10, 'person10', 'person9'),
(11, 'person5', 'person10'),
(12, 'person3', 'person7'),
(13, 'person9', 'person5'),
(14, 'person9', 'person7'),
(15, 'person9', 'person5'),
(16, 'person3', 'person6'),
(17, 'person8', 'person9'),
(18, 'person10', 'person2'),
(19, 'person7', 'person5'),
(20, 'person10', 'person8');

Using R, I can plot the table the data like this:

library(igraph)

friendships = structure(list(person = c("person3", "person10", "person2", "person6", 
"person4", "person6", "person10", "person5", "person3", "person9", 
"person9", "person9", "person3", "person8", "person10", "person7", 
"person10"), friend = c("person9", "person4", "person1", "person7", 
"person10", "person7", "person9", "person10", "person7", "person5", 
"person7", "person5", "person6", "person9", "person2", "person5", 
"person8")), row.names = c(1L, 3L, 4L, 5L, 7L, 8L, 10L, 11L, 
12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L), class = "data.frame")


g <- graph_from_data_frame(df, directed = FALSE)
plot(g, vertex.size=10, vertex.label.cex=0.8)

enter image description here

ONLY using SQL commands, I want to answer the following questions (in the future, generalize to the n-th degree):

  • How many nodes are connected to Person10 by degree2 (i.e. friends of friends,, 8 people)
  • Who are these people that are connected to person10 by degree 2? (i.e. person2, person1, person4, person8, person9, person5, person7, person3)

I am not sure how to do these in SQL - I normally use igraph for these problems.

Here is my attempt to answer the first one:

SELECT COUNT(DISTINCT f2.friend)
FROM friendships f1
JOIN friendships f2 ON f1.friend = f2.person
WHERE f1.person = 'person10' AND f2.friend != 'person10' AND f2.friend NOT IN (
    SELECT friend FROM friendships WHERE person = 'person10'
);

#result
 3

Here is my attempt to answer the second one:

SELECT DISTINCT f2.friend
FROM friendships f1
JOIN friendships f2 ON f1.friend = f2.person
WHERE f1.person = 'person10' AND f2.friend != 'person10' AND f2.friend NOT IN (
    SELECT friend FROM friendships WHERE person = 'person10'
);

#result
1 person5
2 person7
3 person1

However these are not the correct results.

Can someone please show me how to do this correctly? Perhaps a loop/function can be written in R that executes the SQL until termination?

I am using an older version of SQL that doesn't have recursive functions.

Thanks!


Solution

  • Consider this recursive example.
    There condition where person='person10' for select one person, otherwise - find level2 friends for all persons.
    Condition lvl<2 - count friends for level 1 and level 2. You can set any possible level.
    And path column only for clarity, if graph not have cycles.

    with t0 as (
    select person,friend from friendships
    -- where person='person10'
      union
    select friend,person from friendships
    -- where friend='person10'
    )
    , r as(
    select person,friend,1 lvl
      ,cast(concat(person,'-',friend) as varchar(1000))path
    from t0
    union all
    select r.person,t.friend ,r.lvl+1 lvl
      ,cast(concat(path,'-',t.friend) as varchar(1000)) path
    from r
    inner join friendships t on (t.person=r.friend and t.friend<>r.person)
    where lvl<2
    union all
    select r.person,t.person ,r.lvl+1 lvl
      ,cast(concat(path,'-',t.person)  as varchar(1000))path
    from r
    inner join friendships t on (t.friend=r.friend and t.person<>r.person)
    where lvl<2
    )
    ,allfriends as (
    select person,friend ,count(*) qty
    from r
    group by person,friend
    )
    select person,count(*)friend_qty
    from allfriends
    group by person
    

    Result for all person

    person friend_qty
    person1 2
    person10 8
    person2 6
    person3 6
    person4 5
    person5 8
    person6 4
    person7 6
    person8 7
    person9 8

    For person='person10' cte r bypasses the graph along all paths.

    person friend lvl path
    person10 person2 1 person10-person2
    person10 person4 1 person10-person4
    person10 person5 1 person10-person5
    person10 person8 1 person10-person8
    person10 person9 1 person10-person9
    person10 person1 2 person10-person2-person1
    person10 person3 2 person10-person9-person3
    person10 person5 2 person10-person9-person5
    person10 person5 2 person10-person9-person5
    person10 person7 2 person10-person5-person7
    person10 person7 2 person10-person9-person7
    person10 person8 2 person10-person9-person8
    person10 person9 2 person10-person8-person9
    person10 person9 2 person10-person5-person9
    person10 person9 2 person10-person5-person9

    AllFriends for person10 (level up 2) and count of path for friend

    person friend qty
    person10 person1 1
    person10 person2 1
    person10 person3 1
    person10 person4 1
    person10 person5 3
    person10 person7 2
    person10 person8 2
    person10 person9 4

    There 2 path's (10-4),(4-10) counted as 1 path - in first cte we use union (not union all).
    For pair (10,9) there 4 path's (10-9),(10-8-9),(10-5-9) and (10-5-9), on other levels we not exclude paths (union all). Possible left only unique path's by grouping.

    Output for lvl=1

    person friend qty
    person10 person2 1
    person10 person4 1
    person10 person5 1
    person10 person8 1
    person10 person9 1

    Demo here

    Example for SQL server. Virtually unchanged (varchar(1000)->nchar(1000)) suitable for MySQL.

    Unfortunately, I don't know how to do extended recursion in postgresql like (select .. union all select ... union all select ...)
    We can do this in postgresql by making it more difficult join anchor and the table.

    Upd1. If graph have cycles, we mast check - we have already passed this node - is there a node in the path? Such a check depends somewhat on the DBMS used (instr,patindex ...). In this case, the path is mandatory.