Search code examples
azure-data-factory

ADF Copy Activity additional column on Source manipulating a value ($$FILEPATH)


Copy Activity with Source == Azure Blob and target == Azure Table. Blobs use a wildcard file path. The $$FILEPATH value can be set as an additional column but the value can not be referenced from dynamic content. Or? Would be awesome if the $$FILEPATH value could be split (hint hint using the @split function) and additional columns made from folders at certain depths. Not finding any documentation how to reference values from dynamic content. Tried to make a variable that references the value but that fails.

Playing with columns from the source is nice when previewing.


Solution

  • AFAIK, $$FILEPATH in copy activity only gives the file path as a new column and copy activity didn't support manipulating the column values during the copy.

    By Using Dataflows in ADF, you can achieve your requirement. But Dataflows didn't support Table storage as sink. So, the workaround in this case can be, first use the dataflow to manipulate the file path column and transform it to a temporary Blob file. After Dataflow activity, use copy activity to copy the temporary Blob file to Table storage.

    In Dataflow give your source and give your wildcard path. Give the column name for the file path.

    enter image description here

    This will give the result like below in the data preview.

    enter image description here

    You can manipulate the column values in the dataflow using derived column transformation. Here, I used split for example.

    enter image description here

    In dataflow sink, give a temporary Blob file(Use output to single file and specify the filename in the sink options).

    My dataflow JSON for your reference:

    {
        "name": "dataflow1",
        "properties": {
            "type": "MappingDataFlow",
            "typeProperties": {
                "sources": [
                    {
                        "dataset": {
                            "referenceName": "source_files",
                            "type": "DatasetReference"
                        },
                        "name": "source1"
                    }
                ],
                "sinks": [
                    {
                        "dataset": {
                            "referenceName": "temp_files",
                            "type": "DatasetReference"
                        },
                        "name": "sink1"
                    }
                ],
                "transformations": [
                    {
                        "name": "derivedColumn1"
                    }
                ],
                "scriptLines": [
                    "source(allowSchemaDrift: true,",
                    "     validateSchema: false,",
                    "     ignoreNoFilesFound: false,",
                    "     rowUrlColumn: 'filepath',",
                    "     wildcardPaths:['*']) ~> source1",
                    "source1 derive(filepath = split(filepath, '/')[size(split(filepath, '/'))]) ~> derivedColumn1",
                    "derivedColumn1 sink(allowSchemaDrift: true,",
                    "     validateSchema: false,",
                    "     partitionFileNames:['temp.csv'],",
                    "     skipDuplicateMapInputs: true,",
                    "     skipDuplicateMapOutputs: true,",
                    "     partitionBy('hash', 1)) ~> sink1"
                ]
            }
        }
    }
    

    Use dataflow activity to execute this dataflow and after dataflow use copy activity like below.

    enter image description here