Search code examples
ssisssasolappowerpivotpowerquery

Log files & Power Pivot - split DATETIME out into distinct columns?


I'm importing some IIS logs into Power Pivot to do some analysis using:

LogParser.exe "
SELECT 
EXTRACT_TOKEN(LogFileName, 5, '\\') As LogFile,
LogRow,
to_localtime(to_timestamp(date,time)) as LOG_DTTM,
cs-UserName as ClientUserName,
cs-Method,cs-Uri-Stem as UriStem,
cs-Uri-Query as UriQuery,
sc-Status as Status,
sc-SubStatus as SubStatus,
time-Taken as ElapsedTimeMS,
c-Ip As ClientIP,
s-ComputerName as ComputerName,
s-Ip as ServerIP,
s-Port as Port,
sc-Win32-Status as Win32Status,
cs(User-Agent) as UserAgent 
    INTO IIS_LOG_PROD_STAGING 
FROM somefile.log" -o:SQL -oConnString:"Driver=SQL Server;Server=MY_SERVER_NAME; Database=MY_DATABASE_NAME;Trusted_Connection=yes" -createTable:ON -e:10 -transactionRowCount:-1

...and my question is: Should I be splitting up the discrete parts of my DateTime columns into seperate columns at the database storage level, or should that be left to calculated columns in the PowerPivot model?

Marco Russo seems to recommend at least splitting the DATE out into a separate column:
http://sqlblog.com/blogs/marco_russo/archive/2011/09/01/separate-date-and-time-in-powerpivot-and-bism-tabular.aspx

PowerPivot still reads the column as a DateTime, but hour/minute/seconds disappears and the number of unique values is reduced to the number of distinct days in your data. Making it easier to join with a Calendar table, of course!

That seems to make sense. But if I know I'm going to want to be doing analysis at levels such as HourOfDay, DayOfWeek, DayOfMonth, etc should I be splitting those out into separate database columns as well?


Solution

  • I highly recommend creating a date table and a time table to do this type of analysis. The date table will help with the day of week, day of month, etc. calculations. It allows you to easily do date calculations and categorizations through simple joins. The time dimension will group by hours. I tend to create these tables in my database and pull them into my Power Pivot model from SQL Server. My general thought is row-level calculations are more efficiently done in lower levels (SQL Database) than in the Power Pivot model. They can be done in both, so the location is up to you and the amount of memory and CPU available on the server and on the computer running the Power Pivot model. Since Power Pivot is opened on individual laptops and I can't control those, I like to do a lot of computation in SQL Server. I see you tagged Power Query. There are scripts available to create a date dimension in Power Query without needing a table in SQL Server. I haven't built a time dimension in Power Query yet, but here's a good SQL Server script. The date table is at the date level. The time table goes down to seconds and allows you to easily roll times up by minute, hour, etc.

    Here's the date table from the link:

    CREATE TABLE [dbo].[DimDate] (
        [DateKey] [int] NOT NULL
        ,[Date] [datetime] NOT NULL
        ,[Day] [char](10) NULL
        ,[DayOfWeek] [smallint] NULL
        ,[DayOfMonth] [smallint] NULL
        ,[DayOfYear] [smallint] NULL
        ,[PreviousDay] [datetime] NULL
        ,[NextDay] [datetime] NULL
        ,[WeekOfYear] [smallint] NULL
        ,[Month] [char](10) NULL
        ,[MonthOfYear] [smallint] NULL
        ,[QuarterOfYear] [smallint] NULL
        ,[Year] [int] NULL
        );
    

    And here's the time table:

    create table time_of_day 
    ( 
         time_of_day_key smallint primary key, 
         hour_of_day_24 tinyint,                --0-23, military/European time 
         hour_of_day_12 tinyint,                --1-12, repeating for AM/PM, for us American types 
         am_pm char(2),                         --AM/PM 
         minute_of_hour tinyint,                --the minute of the hour, reset at the top of each hour. 0-59 
         half_hour tinyint,                     --1 or 2, if it is the first or second half of the hour 
         half_hour_of_day tinyint,              --1-24, incremented at the top of each half hour for the entire day 
         quarter_hour tinyint,                  --1-4, for each quarter hour 
         quarter_hour_of_day tinyint,           --1-48, incremented at the tope of each half hour for the entire day 
         string_representation_24 char(5),      --military/European textual representation 
         string_representation_12 char(5)       --12 hour clock representation sans AM/PM 
    ) 
    

    Even if you aren't really creating a dimensional model, these tables can be helpful to have.