I'm working on a project where I need to analyze Apache logs using SSAS. I've already loaded data into temporary table. I created dimension tables (primary key and attibute_name), empty fact table (foreign keys for each dimension table and fact_attribute) and created relations between them. Then I split data from that table into dimension tables using
INSERT INTO DimIP (IP) SELECT DISTINCT RemoteHostName FROM tmp
...and so on.
Now I need to populate Fact table with foreign keys, but I don't have any idea how to do this with single query. I tried something like this:
INSERT INTO Facts (DimDateID, DimIPID, DimRefererID, DimRequestID, DimStatusCodeID, DimUserAgentID)
SELECT DimDate.ID WHERE (DimDate.Data = tmp.DateTime)
SELECT DimIP.ID WHERE (DimIP.IP = tmp.RemoteHostName)
SELECT DimReferer.ID WHERE (DimReferer.Referer = tmp.Referer)
SELECT DimRequest.ID WHERE (DimRequest.Request = tmp.Request)
SELECT DimStatusCode.ID WHERE (DimStatusCode.StatusCode = tmp.StatusCode)
SELECT DimUserAgent.ID WHERE (DimUserAgent.UserAgent = tmp.UserAgent)
But it doesn't work (it says insert list contains fewer items than select list), probably I can't use such syntax.
I tried doing it one by one, like this:
INSERT INTO Facts (DimDateID)
SELECT DimDate.ID WHERE (DimDate.Data = tmp.DateTime)
But sometimes it says that other column can't be NULL (ex. DimUserAgentID), so query fails, sometimes it executes query, says "806000 rows affected" but nothing is inserted.
I will appreciate your help, cause I already ripped half of my hair from my head and don't know how the way to populate fact table with foreign keys from dimension tables.
I believe what you need to do is reference those other tables in your query. Below I use the tmp
as the main driver of the query and then attempted to look up the resulting ID based on the logic you provided. Those lookups are via LEFT OUTER JOIN
s which implies the relationship may not be there in which case NULL will go into your fact table. If you'd rather have the row filtered out of hitting the fact table, substitute an INNER JOIN
for all of the occurrences. I also assumed your tables were all in dbo schema.
INSERT INTO
dbo.Facts
(
DimDateID
, DimIPID
, DimRefererID
, DimRequestID
, DimStatusCodeID
, DimUserAgentID
)
SELECT
DimDate.ID
, DimIP.ID
, DimReferer.ID
, DimRequest.ID
, DimStatusCode.ID
, DimUserAgent.ID
FROM
TMP T
LEFT OUTER JOIN
dbo.DimDate
ON DimDate.Data = T.DateTime
LEFT OUTER JOIN
dbo.DimIP
ON DimIP.IP = T.RemoteHostName
LEFT OUTER JOIN
dbo.DimReferer
ON DimReferer.Referer = T.Referer
LEFT OUTER JOIN
dbo.DimRequest
ON DimRequest.Request = T.Request
LEFT OUTER JOIN
dbo.DimStatusCode
ON DimStatusCode.StatusCode = T.StatusCode
LEFT OUTER JOIN
dbo.DimUserAgent
ON DimUserAgent.UserAgent = T.UserAgent
Finally, it seems you're missing something measurable, unless you're just counting rows in the Facts
table.