Search code examples
sqlapacheapache-drillnosql

Apache Drill mash file metadata with results


My present query on Apache Drill is as follows -

select 
 CAST(columns[0] AS Integer) as ID, 
 to_timestamp(SUBSTR(columns[10],1,14),'YYYYMMddHHmmss') as RecordDt
from dfs.`/data/100Krows/`

What I would like to do is get the modificationTime metadata for each of the .csv files in my folder as a column something like this -

select 
  CAST(columns[0] AS Integer) as ID, 
  to_timestamp(SUBSTR(columns[10],1,14),'YYYYMMddHHmmss') as RecordDt,
  modificationTime as versionStartTime
from dfs.`/data/100Krows/`

The modificationTime metadata can be found when I run the following

 SHOW FILES in dfs.data

Is there a way to achieve this in its current version 1.1 or 1.0?

UPDATE 1

What I am truly looking for is the file metadata included in the result set of select statement against individual files

if my directory structure is as follows -

dir0
 dir1
  samefilename.csv    -- Modified Date - 10/01/2013
    *******DATA******
    Col0,Col1,Col2
    Val1-2013,Val2-2013,Val3-2014
    *******DATA******
 dir2
   samefilename.csv    -- Modified Date - 10/01/2014
    *******DATA******
     Col0,Col1,Col2
     Val1-2014,Val2-2014,Val3-2014
    *******DATA******

Kind of query I want to write -

select 
  name             as filename,  
  modificationTime as versionStartTime,
 Col0,
 Col2,
 Col3
from dfs.`/dir0/`

Expected Result -

 Columns[0]         Columns[1]   Columns[2]        .....
 samefileName.csv   10/01/2013   Val1-2013         .....
 samefileName.csv   10/01/2014   Val1-2014         .....

Solution

  • Dakuji,

    This is currently not exactly possible (as of version 1.1) but here is a proxy for it until this Drill issue that addresses your question is implemented. What I am proposing below is not the exact solution, but with some work can get you there.

    Preconditions and notes:

    1. Have .tbl extension registered under .psv storage configuration.
    2. I do this under dfs.
    3. is a placeholder for the file you name whatever you want.

    Do this from the Drill's SQLLine prompt:

    0: jdbc:drill:zk=local> use dfs;  
    0: jdbc:drill:zk=local> record! <file_name.tbl>  
    0: jdbc:drill:zk=local> SHOW FILES;  
    0: jdbc:drill:zk=local> record!  
    0: jdbc:drill:zk=local> select columns[9] from dfs.`<file_name.tbl>';
    

    There might be some noise in some column entries, but you can further filter that out with a query.

    > 0: jdbc:drill:zk=local> !record <file_name.tbl>
    Recording stopped.
    0: jdbc:drill:zk=local> select columns[9] from dfs.`<\path\<file_name.tbl>`;
    +---------------------------+
    |          EXPR$0           |
    +---------------------------+
    | null                      |
    | null                      |
    |     modificationTime      |
    | null                      |
    |  2013-04-26 23:44:56.0    |
    |  2015-08-12 16:21:39.0    |
    |  2014-04-26 10:09:33.0    |
    |  2015-04-27 22:49:48.0    |
    |  2013-08-24 20:16:29.0    |
    |  2015-08-23 19:26:54.0    |
    |  2014-10-21 16:43:38.0    |
    |  2014-04-26 10:09:35.0    |
    |  2014-10-21 16:43:38.0    |
    |  2012-11-16 11:38:17.0    |
    |  2012-11-16 11:44:27.0    |
    |  2014-10-21 16:42:57.0    |
    |  2014-04-26 10:09:28.0    |
    |  2013-08-24 21:48:39.0    |
    |  2015-08-20 22:27:34.0    |
    +---------------------------+
    19 rows selected (0.082 seconds)