Search code examples
oracle-databaseplsqlexternal-tables

Getting an external table's location and file name?


I'm processing multiple files as part of an external table. Is there any way I can get the file name being processed in external tables and place it in the table?

At the moment the only solution I can find is appending the file name to every record in the flat file which isn't ideal from an efficiency standpoint and involves modifying the original data. Surely external tables know what file is being processed at any time?


Solution

  • I'm not aware of any way to capture the file name within the access parameters. As a workaround, rather than modifying the original files you could use a preprocessor to append the file name on the fly. If you had two files, say file_1.csv containing a,b,1 and file_2.csv containing c,d,2, you could have a small shell script like append_filename.sh:

    #!/bin/bash
    while read line
    do
      printf "%s,%s\n" "${line}" "${1##*/}"
    done < $1
    

    which you can verify does something helpful by calling the script directly:

    $ ./append_filename.sh file_1.csv
    a,b,1,file_1.csv
    

    You can then define your external table to call that via the preprocessor clause, something like:

    create table e42 (
      col1 varchar2(10),
      col2 varchar2(10),
      col3 number,
      filename varchar2(30)
    )
    organization external (
      type oracle_loader
      default directory d42
      access parameters (
        records delimited by newline
        preprocessor 'append_filename.sh'
        fields terminated by ','
      )
      location ('file_1.csv', 'file_2.csv')
    );
    
    Table E42 created.
    

    Then the file name is picked up automatically:

    select * from e42;
    
    COL1       COL2             COL3 FILENAME                     
    ---------- ---------- ---------- ------------------------------
    a          b                   1 file_1.csv                    
    c          d                   2 file_2.csv                    
    

    I've stripped the directory path so you only see the file name - you could keep the full path if you prefer, but that may not be necessary and could reveal OS details to people who can only query the table. Note the security guidelines; I've kept it simple here by using one directory for everything, but you should put the preprocessor somewhere else. And of course this is assuming a Unix-y platform or GNU tools; something similar should be possible with a batch file if you're using Windows.


    This approach reading line-by-line will be relatively slow for large files; with a 1.5 million-row test file appending the file name took about 80 seconds on my platform. Other built-in tools will be faster; this version with sed takes just over a second for the same file:

    #!/bin/bash
    sed -e 's!$!,'"${1##*/}"'!' $1
    

    You could try other alternative such as awk too; you'd probably need to test a few to see what works best (or fast enough) in your environment.