Search code examples
sql-serverssissql-server-2016ssis-2016

How can I delete the columns in DataFlow Task in SSIS?


I use SQL Server 2016 and I have a very busy DataFlow task. In my DataFlow task, I use Multicast component for some reason. After creating a new Flow in my DataFlow, I need to delete some of the columns in the new flow because they are useless.

enter image description here

Just for more information, I need to do that because I have more than 200 columns in my flow and I need less than 10 of those columns.

How can I delete the columns in DataFlow Task in SSIS?


Solution

  • You can add an extra component of some sort. However, this will never reduce complexity or improve performance. Just thinking about it, logically, you are adding an additional interface that needs to be maintained. Performance-wise, anything that will eliminate columns means copying one set of rows from one buffer to a whole other buffer. This is called an asynchronous transformation, and it is better described here and here. You can imagine that copying rows is less efficient than updating them in place.

    Here are some recommendations for reducing complexity, which will, in turn, improve performance:

    • Reduce the columns at the source. If you are selecting columns that are not subsequently used in any way, then remove them from the query or uncheck them from the source component. Removing columns in this way removes them from the buffer, which will occupy less memory.
    • Reduce the number of components in the dataflow. Very long dataflows are easy to create, a pain to test and even harder to maintain. Dataflows are expecting a unit of work, i.e. a data stream from here to there with a few things in the middle. This is where dataflows shine, in fact, they protect themselves from complexity with memory limitations and a max number of threads. It is better to divide the work into separate dataflows or stored procs. You could stage the data into a table and read it twice, rather than use a multicast, for example.
    • Use the database. SSIS is as much an orchestration tool as it is a data-moving tool. I have often found that using simple dataflows to stage the data, followed by calls to stored procedures to process the data, always out-performs an all-in-one dataflow.
    • Increase the number of times you write the data. This is completely counter intuitive, but if you process data in smaller sets of operations, it is faster running and easier to test. Given a clean slate, I will often design an ETL to write data from the source to a staging table, perform a cleansing step from the stage table to another, optionally, add a conforming step to combine data from different sources to yet another table and, finally, a last step to load a target table. Note that each source is pushed to its own target table and later combined, leveraging the database. The first and last steps are set up to run fast and avoid locking or blocking at either end.
    • Bulk Load. The prior step really does well, when you insure that bulk loading is happening. This can be a tricky thing, but generally you can get there by using "fast load" in the OLEDB destination and by never using the oledb command. Removing indexes and re-adding them is faster than loading in place (with few exceptions).

    These guidelines will get you headed in the general direction, but do post more questions for tuning specific performance problems.