Search code examples
pythonsqlpysparkhiveregexp-replace

Join on two table, file_name having extra string, regex to remove string from filename and do the join


I have two tables on which in need to apply join on table_name and file_name respectively. Problem is table_name have some extra string in comparison to file_name in table 2.

Using a regex, how can I remove extra string from table_name to make it compatible for join with file_name of table 2?

TABLE 1:

table_name                                                                 audit_record_count
Immunology_COVID-19_Treatment_202006221630_01.csv                          1260124
Immunology_COVID-19_Trial_Design_202006221630_01.csv                       2173762
Immunology_COVID-19_Planned_Treatment_202006221630_01.csv                  1350135
Immunology_COVID-19_Patient_Characteristic_202006221630_01.csv             2173762
Immunology_COVID-19_Intervention_Type_202006221630_01.csv                  2173762
Immunology_COVID-19_Arm_202006221630_01.csv                                      4
Immunology_COVID-19_Actual_Treatment_202006221630_01.csv                   2173762
Immunology_COVID-19_Publication_202006221630_01.csv                        2173762
Immunology_COVID-19_Outcome_202006221630_01.csv                            2173762
Immunology_COVID-19_Intervention_Type_Factor_202006221630_01.csv           2173762
Immunology_COVID-19_Inclusion_Criteria_202006221630_01.csv                 2173762
Immunology_COVID-19_Curation_202006221630_01.csv                           2173762

TABLE 2:

file_name                           csv_record_count
Treatment                           1260124
Trial_Design                        2173762
Planned_Treatment                   1350135
Patient_Characteristic              2173762
Intervention_Type                   2173762
Arm                                       4
Actual_Treatment                    2173762
Publication                         2173762
Outcome                             2173762
Intervention_Type_Factor            2173762
Inclusion_Criteria                  2173762
Curation                            2173762

What I have tried :

audit_file_df = spark.read.csv(
            f"s3://{config['raw_bucket']}/{config['landing_directory']}/{config['audit_file']}/{watermark_timestamp}*.csv",
            header=False, inferSchema=True) \
            .withColumnRenamed("_c0", "table_name").withColumnRenamed("_c1", "audit_record_count")\
            .selectExpr("regexp_extract(table_name, '^(.(?!(\\\\d{12}_\\\\d{2,4}.csv|\\\\d{12}.csv)))*', 0) AS table_name",'audit_record_count') 
        
        print("audit_file_df :",audit_file_df)
        audit_file_df.show()
        validation_df = audit_file_df.join(schema_validation_df, how='inner', on=audit_file_df['table_name'] == schema_validation_df['file_name']).withColumn("count_match",
                                                                                                     col=col(
                                                                                                         'audit_record_count') == col(
                                                                                                         'csv_record_count'))
        print("Record validation result")
        validation_df.show()

I am able to remove timestamp from table_name but not able to extract file_name to make join condition work.

Addition

Immunology_COVID-19 is not fixed it might change for another file , format for table_name is:

TA_Indication_data_timestamp_nn.csv

Solution

  • Create an additional column in table 1 that contains the data part:

    df = df.withColumn('data', F.regexp_extract(F.col('table_name'), '.*?_.*?_(.*)_\d{12}_\d{2}\.csv', 1))
    

    gives

    +----------------------------------------------------------------+---------+------------------------+
    |table_name                                                      |audit_rec|data                    |
    +----------------------------------------------------------------+---------+------------------------+
    |Immunology_COVID-19_Treatment_202006221630_01.csv               |1260124  |Treatment               |
    |Immunology_COVID-19_Trial_Design_202006221630_01.csv            |2173762  |Trial_Design            |
    |Immunology_COVID-19_Planned_Treatment_202006221630_01.csv       |1350135  |Planned_Treatment       |
    |Immunology_COVID-19_Patient_Characteristic_202006221630_01.csv  |2173762  |Patient_Characteristic  |
    |Immunology_COVID-19_Intervention_Type_202006221630_01.csv       |2173762  |Intervention_Type       |
    |Immunology_COVID-19_Arm_202006221630_01.csv                     |4        |Arm                     |
    |Immunology_COVID-19_Actual_Treatment_202006221630_01.csv        |2173762  |Actual_Treatment        |
    |Immunology_COVID-19_Publication_202006221630_01.csv             |2173762  |Publication             |
    |Immunology_COVID-19_Outcome_202006221630_01.csv                 |2173762  |Outcome                 |
    |Immunology_COVID-19_Intervention_Type_Factor_202006221630_01.csv|2173762  |Intervention_Type_Factor|
    |Immunology_COVID-19_Inclusion_Criteria_202006221630_01.csv      |2173762  |Inclusion_Criteria      |
    |Immunology_COVID-19_Curation_202006221630_01.csv                |2173762  |Curation                |
    +----------------------------------------------------------------+---------+------------------------+
    

    Then you can join the tables using table1.data and table2.file_name and continue with the audit check that you have already given in the question.

    The tricky part of the regexp is to use non-greedy qualifiers, as the data part itself can contain underscore characters.