Search code examples
sql-servert-sqlsql-server-2017dataexplorer

Use an initial query to merge queries across multiple databases?


Using the Data Explorer (SEDE), I would like to find which users have more than 200000 reputation on Stack Overflow, and then find details for any accounts they have on other Stack Exchange sites.

Here is the query which provides the list with this threshold:

Select id, reputation, accountid
From users
Where reputation > 200000

AccountId is the key for all Stack Exchange sites.

I have found this query for aggregating across SEDE databases, but how is it possible to do that based on the dynamic results of the previous/baseline query?

Here is the kind of output I'm aiming for:

id_so, reputation_so, accounted, other_stackexchange_site_name, reputation_othersite, number_of_answers_other_site, number_of_questions_other_site
1, 250000, 23, serverfault, 500, 5, 1
1, 250000, 23, superuser, 120, 1, 0
2, 300000, 21, serverfault, 300, 3, 2
2, 300000, 21, webmasters, 230, 1, 1
3, 350000, 20, NA, NA, NA, NA
#the case with id 3 has an SO profile with reputation but it has no other profile in other Stack Exchange site

Solution

  • To run non-trivial queries across databases, based on an initial query:

    1. Figure out the common key in all databases. In this case it's AccountId (which is a user's Stack-Exchange-wide Id).
    2. Create your initial query to feed that key into a temp table. In this case:

      CREATE TABLE #UsersOfInterest (AccountId INT)
      INSERT INTO  #UsersOfInterest  
          SELECT  u.AccountId
          FROM    Users u
          Where   u.Reputation > 200000
      
    3. Create Another temp table to hold the final results (see below).
    4. Determine the query, to run on each site, that gets the info you want. EG:

      SELECT  u.AccountId, u.DisplayName, u.Reputation, u.Id
              , numQst = (SELECT COUNT(q.Id)  FROM Posts q  WHERE q.OwnerUserId = u.Id  AND q.PostTypeId = 1)
              , numAns = (SELECT COUNT(q.Id)  FROM Posts q  WHERE q.OwnerUserId = u.Id  AND q.PostTypeId = 2)
      FROM    Users u
      WHERE   u.AccountId = ##seAccntId##
      
    5. Use a system query to get the appropriate databases. For the Data Explorer (SEDE), a query of this type:

      SELECT      name
      FROM        sys.databases
      WHERE       CASE    WHEN state_desc = 'ONLINE'
                          THEN OBJECT_ID (QUOTENAME (name) + '.[dbo].[PostNotices]', 'U')
                  END
                  IS NOT NULL
      
    6. Create a cursor on the above query and use it to step through the databases.
      For each database:

      1. Build a query string that takes the query of step 4 and puts it into the temp table of step 3.
      2. Run the query string using sp_executesql.
    7. When the cursor is done, perform the final query on the temp table from step 3.


    Refer to this other answer, for a working template for querying all of the Stack Exchange sites.

    Putting it all together, results in the following query, which you can run live on SEDE:

    -- MinMasterSiteRep: User's must have this much rep on whichever site this query is run against
    -- MinRep: User's must have this much rep on all other sites
    
    CREATE TABLE #UsersOfInterest (
        AccountId       INT NOT NULL
        , Reputation    INT
        , UserId        INT
        , PRIMARY KEY (AccountId)
    )
    INSERT INTO  #UsersOfInterest
        SELECT  u.AccountId, u.Reputation, u.Id
        FROM    Users u
        Where   u.Reputation > ##MinMasterSiteRep:INT?200000##
    
    CREATE TABLE #AllSiteResults (
          [Master Rep]          INT
          , [Mstr UsrId]        NVARCHAR(777)
          , AccountId           NVARCHAR(777)
          , [Site name]         NVARCHAR(777)
          , [Username on site]  NVARCHAR(777)
          , [Rep]               INT
          , [# Ans]             INT
          , [# Qst]             INT
    )
    
    DECLARE @seDbName       AS NVARCHAR(777)
    DECLARE @seSiteURL      AS NVARCHAR(777)
    DECLARE @sitePrettyName AS NVARCHAR(777)
    DECLARE @seSiteQuery    AS NVARCHAR(max)
    
    DECLARE seSites_crsr CURSOR FOR
    WITH dbsAndDomainNames AS (
        SELECT      dbL.dbName
                    , STRING_AGG (dbL.domainPieces, '.')    AS siteDomain
        FROM (
            SELECT      TOP 50000   -- Never be that many sites and TOP is needed for order by, below
                        name        AS dbName
                        , value     AS domainPieces
                        , row_number ()  OVER (ORDER BY (SELECT 0)) AS [rowN]
            FROM        sys.databases
            CROSS APPLY STRING_SPLIT (name, '.')
            WHERE       CASE    WHEN state_desc = 'ONLINE'
                                THEN OBJECT_ID (QUOTENAME (name) + '.[dbo].[PostNotices]', 'U') -- Pick a table unique to SE data
                        END
                        IS NOT NULL
            ORDER BY    dbName, [rowN] DESC
        ) AS dbL
        GROUP BY    dbL.dbName
    )
    SELECT      REPLACE (REPLACE (dadn.dbName, 'StackExchange.', ''), '.', ' ' )  AS [Site Name]
                , dadn.dbName
                , CASE  -- See https://meta.stackexchange.com/q/215071
                        WHEN dadn.dbName = 'StackExchange.Mathoverflow.Meta'
                        THEN 'https://meta.mathoverflow.net/'
                        -- Some AVP/Audio/Video/Sound kerfuffle?
                        WHEN dadn.dbName = 'StackExchange.Audio'
                        THEN 'https://video.stackexchange.com/'
                        -- Ditto
                        WHEN dadn.dbName = 'StackExchange.Audio.Meta'
                        THEN 'https://video.meta.stackexchange.com/'
                        -- Normal site
                        ELSE 'https://' + LOWER (siteDomain) + '.com/'
                END AS siteURL
    FROM        dbsAndDomainNames dadn
    WHERE       (dadn.dbName = 'StackExchange.Meta'  OR  dadn.dbName NOT LIKE '%Meta%')
    
    -- Step through cursor
    OPEN    seSites_crsr
    FETCH   NEXT FROM seSites_crsr INTO @sitePrettyName, @seDbName, @seSiteURL
    WHILE   @@FETCH_STATUS = 0
    BEGIN
        SET @seSiteQuery = '
            USE [' + @seDbName + ']
    
            INSERT INTO #AllSiteResults
                SELECT
                            uoi.Reputation                                                                                  AS [Master Rep]
                            , ''site://u/'' + CAST(uoi.UserId AS NVARCHAR(88)) + ''|'' + CAST(uoi.UserId AS NVARCHAR(88))   AS [Mstr UsrId]
                            , [AccountId] = ''https://stackexchange.com/users/'' + CAST(u.AccountId AS NVARCHAR(88)) + ''?tab=accounts|'' + CAST(u.AccountId AS NVARCHAR(88))
                            , ''' + @sitePrettyName + '''                                                                   AS [Site name]
                            , ''' + @seSiteURL + ''' + ''u/'' + CAST(u.Id AS NVARCHAR(88)) + ''|'' + u.DisplayName          AS [Username on site]
                            , u.Reputation                                                                                  AS [Rep]
                            , (SELECT COUNT(q.Id)  FROM Posts q  WHERE q.OwnerUserId = u.Id  AND q.PostTypeId = 2)          AS [# Ans]
                            , (SELECT COUNT(q.Id)  FROM Posts q  WHERE q.OwnerUserId = u.Id  AND q.PostTypeId = 1)          AS [# Qst]
                FROM        #UsersOfInterest uoi
                INNER JOIN  Users u                ON uoi.AccountId = u.AccountId
                WHERE       u.Reputation > ##MinRep:INT?200##
        '
        EXEC sp_executesql @seSiteQuery
    
        FETCH NEXT FROM seSites_crsr INTO @sitePrettyName, @seDbName, @seSiteURL
    END
    CLOSE       seSites_crsr
    DEALLOCATE  seSites_crsr
    
    SELECT      *
    FROM        #AllSiteResults
    ORDER BY    [Master Rep] DESC, AccountId, [Rep] DESC
    

    It gives results like:

    Query output

    -- where the blue values are hyperlinked.


    Note that a user must have 200 rep on a site for it to be "significant". That's also the rep needed for the site to be included in the Stack Exchange flair.