Search code examples
regexsnowflake-cloud-data-platformregex-group

Regex Expression for Snowflake Pattern


I have a list of files in the s3 bucket as below. I want to load only the current date files, for example, I want to get all file_a.csv which are under 2020/06/09, similarly get all file_b.csv under same 2020/06/09

I tried out the syntax but this gets me all files from all dates select metadata$filename from @stage/snflk/ts/(file_format=>CSV_SKIP_HEADER,pattern=>'.*/file_a.*[.]csv')

snflk/ts/ts-prod/2020/06/08/file_a.csv
snflk/ts/ts-prod/2020/06/09/file_a.csv
snflk/ts/ts-prod3/2020/06/08/file_a.csv
snflk/ts/ts-prod3/2020/06/09/file_a.csv
snflk/ts/ts-prod2/2020/06/08/file_a.csv
snflk/ts/ts-prod2/2020/06/09/file_a.csv
snflk/ts/ts-prod/2020/06/08/file_b.csv
snflk/ts/ts-prod/2020/06/09/file_b.csv
snflk/ts/ts-prod3/2020/06/08/file_b.csv
snflk/ts/ts-prod3/2020/06/09/file_b.csv
snflk/ts/ts-prod2/2020/06/08/file_b.csv
snflk/ts/ts-prod2/2020/06/09/file_b.csv

Solution

  • I want to load only the current date files

    .*/file_a.*[.]csv

    this gets me all files from all dates

    The pattern (.*) supplied at the beginning here is a wildcard that will match everything available.

    If you need to limit the date to a constant value, specify the constant values inside the regular expression pattern (i.e. ts-prod.*/2020/06/09/file_a\.csv, ts-prod.*/2020/06/09/file_b\.csv, etc.):

    select
      metadata$filename
    from @stage/snflk/ts/
    (pattern=>'ts-prod.*/2020/06/09/file_a\.csv');
    

    To iteratively test and write your regex patterns, you can use web tools such as Regex101, RegExr, etc.

    P.s. You can also construct the pattern to use the current date dynamically, if you are trying to automate this:

    -- Produces 'ts-prod.*/2020/06/11/file_a\.csv'
    SET curr_dt_a=(
      SELECT 
        'ts-prod.*/' || 
        TO_VARCHAR(CURRENT_DATE(), 'YYYY/MM/DD') || 
        '/file_a\\.csv'
    )
    
    -- Referenced as a SQL variable in the PATTERN option
    SELECT
      metadata$filename
    FROM @stage/snflk/ts/
    (PATTERN=>$curr_dt_a);