Search code examples
sql-serverconditional-statementsprobabilityintersectionindependent-set

How to INTERSECT two columns in SQL Server


I have my data table AC in SQL Server with structure as:

+----------+------------+-------+
| AuthorID | CoAuthorID | Year  |
+----------+------------+-------+
|  677     | 901706     |  2005 |
|  677     | 901706     |  2005 |
|  677     | 901706     |  2005 |
|  1359    | 133112     |  2005 |
|  1359    | 133112     |  2005 |
|  1359    | 133112     |  2005 |
|  1359    | 266386     |  2005 |
|  1359    | 454557     |  2005 |
|  1359    | 454557     |  2005 |
|  1359    | 454557     |  2005 |
|  1359    | 534423     |  2005 |
|  1359    | 534423     |  2005 |
|  1359    | 534423     |  2005 |
|  1359    | 578338     |  2005 |
|  1359    | 721615     |  2005 |
|  1359    | 1016805    |  2005 |
|  1359    | 1016805    |  2005 |
|  1359    | 1016805    |  2005 |
|  1359    | 1361047    |  2005 |
|  1359    | 1361047    |  2005 |
|  1359    | 1361047    |  2005 |
|  1359    | 1361320    |  2005 |
|  1359    | 1361320    |  2005 |
|  1359    | 1361320    |  2005 |
|  1359    | 1395982    |  2005 |
|  1359    | 1395982    |  2005 |
|  1359    | 1395982    |  2005 |
|  1359    | 1412785    |  2005 |
|  1359    | 1412785    |  2005 |
|  1359    | 1412785    |  2005 |
|  1359    | 1412785    |  2005 |
|  ...     |            |       |
|  ...     |            |       |
+----------+------------+-------+

I have to calculate the yearly Conditional Probability of AuthorID given CoAuthorID i.e.

P(AuthorID|CoAuthorID) = P(AuthorID ∩ CoAuthorID) / P(CoAuthorID)

while in year 2005 where it is an intersection operation.

Initially, for instance, AuthorID = 677 and CoAuthorID = 901706 and Year = 2005, I have tried this:

For P(AuthorID):

SELECT COUNT(DISTINCT AuthorID) FROM AC WHERE Year = 2005  

It returns 390 so P(AuthorID) = 1/390

For P(CoAuthorID):

SELECT COUNT(DISTINCT CoAuthorID) FROM AC WHERE AuthorID = 677 AND Year = 2005

It returns 1 so P(CoAuthorID) = 1/1

For P(AuthorID ∩ CoAuthorID):

SELECT * FROM AC WHERE AuthorID = 677 AND Year = 2005
INTERSECT 
SELECT * FROM AC WHERE CoAuthorID = 901706 AND Year = 2005

It returns 1 row as:

AuthorID    CoAuthorID  Year
----------------------------
677         901706      2005

Whereas there are 3 rows in data, it means AuthorID and CoAuthorID co-exists in data 3 times in 2005 means that these two Authors have contributed together 3 times in 2005. So,

  1. What should be the value for P(AuthorID ∩ CoAuthorID)? Should it be 1 or 1/3?
  2. Also are other calculations are correct ?

Thanks!


Solution

  • This formula is somewhat correct:

    P(AuthorID|CoAuthorID) = P(AuthorID ∩ CoAuthorID) / P(CoAuthorID) 
    

    Although you should specify the year constraint as well to be more exact:

    P(AuthorID|CoAuthorID ∩ Year) = P(AuthorID ∩ CoAuthorID|Year)/P(CoAuthorID|Year)
    

    You give a calculation for P(AuthorID) (or actually P(AuthorID|Year)), but that term does not occur in the above formula, so it cannot be useful.

    Secondly, you should not use DISTINCT. If there are many occurrences of the same author, then the probability of that author should be considered higher than that of another, less frequently occurring author. By using DISTINCT you would ignore these different frequencies, and attribute to each author the same probability. This cannot be the intention.

    Instead you can identify these probabilities as follows:

    P(CoAuthorID|Year)

    Use this SQL to get that probability (between 0 and 1):

    SELECT SUM(CASE 
                  WHEN CoAuthorID = 901706 THEN 1 
               END)       AS matching_records,
           COUNT(*)       AS considered_records,
           CAST(SUM(CASE 
                  WHEN CoAuthorID = 901706 THEN 1 
               END) AS FLOAT)
               / COUNT(*) AS probability  
    FROM   AC 
    WHERE  Year = 2005
    

    This returns in fact 3 values, while you only need the third. But it will help in analysing the result to also include the two first values.

    The first value counts every occurrence of the given CoAuthorID in 2005 ("matching_records"), the second counts the total number of records for 2005 ("considered_records") and the third divides the first by the second to arrive at the probability.

    In an extreme case, all of the 2005 records could have this CoAuthorID, and then the probability would return 1.

    P(AuthorID ∩ CoAuthorID|Year)

    Similarly, you can do the following for getting this probability:

    SELECT SUM(CASE 
                  WHEN AuthorID = 677 AND CoAuthorID = 901706 THEN 1 
               END)       AS matching_records,
           COUNT(*)       AS considered_records,
           CAST(SUM(CASE 
                  WHEN AuthorID = 677 AND CoAuthorID = 901706 THEN 1 
               END) AS FLOAT)
           / COUNT(*)     AS probability  
    FROM   AC 
    WHERE  Year = 2005
    

    Notice the pattern that is emerging: in general, the constraint Y in P(X|Y) occurs in the WHERE clause, while X is represented in the CASE WHEN clause.

    P(AuthorID|CoAuthorID ∩ Year)

    Although you could use the results of the previous two queries, you could go for the more direct way, and use the pattern for P(AuthorID|CoAuthorID ∩ Year):

    SELECT SUM(CASE 
                  WHEN AuthorID = 677 THEN 1 
               END)       AS matching_records,
           COUNT(*)       AS considered_records,
           CAST(SUM(CASE 
                  WHEN AuthorID = 677 THEN 1 
               END) AS FLOAT)
           / COUNT(*)     AS probability  
    FROM   AC 
    WHERE  Year = 2005
    AND    CoAuthorID = 901706 
    

    Answer to first question

    What should be the value for P(AuthorID ∩ CoAuthorID)? Should it be 1 or 1/3?

    As you have 3 rows in your data for which the AuthorId, CoAuthorID and Year match, P(AuthorID ∩ CoAuthorID|Year) is 3/num_records, where num_records is the number or records for which Year is 2005.

    Note that INTERSECT has as side-effect that it eliminates duplicates in the result. What I wrote above concerning DISTINCT also applies here: you need to count the duplicates when dealing with probabilities.

    Addendum: Some example results

    Given this data:

    +----------+------------+-------+
    | AuthorID | CoAuthorID | Year  |
    +----------+------------+-------+
    |  677     | 901706     |  2005 |
    |  677     | 901706     |  2005 |
    |  677     | 901706     |  2005 |
    |  1359    | 133112     |  2005 |
    |  1359    | 133112     |  2005 |
    |  1359    | 133112     |  2005 |
    |  1359    | 266386     |  2005 |
    |  1359    | 454557     |  2005 |
    |  1359    | 454557     |  2005 |
    |  1359    | 454557     |  2005 |
    |  1359    | 534423     |  2005 |
    |  1359    | 534423     |  2005 |
    |  1359    | 534423     |  2005 |
    |  1359    | 578338     |  2005 |
    |  1359    | 721615     |  2005 |
    |  1359    | 1016805    |  2005 |
    |  1359    | 1016805    |  2005 |
    |  1359    | 1016805    |  2005 |
    |  1359    | 1361047    |  2005 |
    |  1359    | 1361047    |  2005 |
    |  1359    | 1361047    |  2005 |
    |  1359    | 1361320    |  2005 |
    |  1359    | 1361320    |  2005 |
    |  1359    | 1361320    |  2005 |
    |  1359    | 1395982    |  2005 |
    |  1359    | 1395982    |  2005 |
    |  1359    | 1395982    |  2005 |
    |  1359    | 1412785    |  2005 |
    |  1359    | 1412785    |  2005 |
    |  1359    | 1412785    |  2005 |
    |  1359    | 1412785    |  2005 |
    +----------+------------+-------+
    

    We can look at the following probabilities:

    1. P(AuthorID=1359 | CoAuthorID=1361047 ∩ Year=2005)

    Meaning: What is the probability for AuthorID=1359 when it is a given that CoAuthor=1361047 and Year=2005

    Informally: In 2005, what fraction of collaborations of co-author 1361047 were with author 1359?

    Number of collaborations considered: 3, because this co-author only collaborated three times in 2005

    Number of collaborations that match: 3, because this co-author collaborated three times with this author in 2005 (i.e. did not collaborate with anyone else)

    Probability: 3/3 = 1.

    2. P(AuthorID=1359 ∩ CoAuthorID=1361047 | Year=2005)

    Meaning: What is the probability for a collaboration to be between AuthorID=1359 and CoAuthor=1361047 when it is a given that Year=2005

    Informally: In 2005, what fraction of collaborations were between author 1359 and co-author 1361047?

    Number of collaborations considered: 31, because there were 31 collaborations in 2005

    Number of collaborations that match: 3, because this author and co-author collaborated three times in 2005

    Probability: 3/31.

    3. P(CoAuthorID=1361047 | AuthorID=1359 ∩ Year=2005)

    Meaning: What is the probability for CoAuthor=1361047 when it is a given that AuthorID=1359 and Year=2005

    Informally: In 2005, what fraction of collaborations of author 1359 were with co-author 1361047?

    Number of collaborations considered: 28, because this author collaborated 28 times (with anyone) in 2005

    Number of collaborations that match: 3, because this author collaborated three times with this co-author in 2005

    Probability: 3/28.

    Observation

    Notice how similar the above three predicates sound, but how different they really are. Exact wording is important. Take for instance this phrase:

    The probability of author A and co-author B being together in 2005

    Seeing this from the viewpoint of co-author B it is 100%, because this person did not co-author with any one else (case 1 above). But seeing this from the viewpoint of author A is different, since that person collaborated with many other co-authors also (case 3 above), and so the probability is smaller. But it can even be interpreted from no-one's viewpoint: then one could look at all collaborations in 2005 and see how many times it was between author A and co-author B (case 2). This is again another probability.

    This shows that statements can easily be ambiguous, and care must be taken to be precise: what is assumed as a given? That must appear after the pipe symbol in the P(X|Y) notation.