Advertisements
Microsoft released Data Transformation Services (DTS) along with SQL Server 7.0. A very powerful, easy to use, graphical tool for importing, exporting and transforming data, which is only available as a costly add-on in other RDBMS products like Oracle. Prior to SQL Server 7.0, all we had was BCP, which is powerful indeed, but not as friendly and functionally rich as DTS. DTS was made robust and dramatically improved with the release of SQL Server 2000, as more and more DBAs and developers started using DTS for building various data loading and transformation solutions. It wouldn't be an exaggeration to say that DTS is one of the most popular ETL (Extract, Transform and Load) tools currently in use. Not bad for a tool that's been around for just about four years.
I haven't had a chance to work a great deal with SQL Server 7.0 DTS. That's one of the reasons why I haven't published a lot of articles on DTS. But I've been using SQL Server 2000 DTS extensively for the last year and half, for implementing various data import/export, transformation solutions. It's during this period, I stopped using the import/export wizard and started using DTS Designer to design my packages. Seriously, designing packages in DTS Designer is much better than using the Wizard, as the Designer gives you complete control over the functionality of the package. But the DTS Designer will take some getting used to though. It took me a while to get used to designing packages and during this period I learned a lot of tips and tricks and established some best practices within my organization, which I am going to share in this article.
In this article, I will list some of the DTS best practices that I follow, along with some tips and tricks. All this information applies to SQL Server 2000 and SQL Server 7.0 (with some exceptions). The following points are in no particular order. At the end of this article.
DTS Package Naming Conventions:
Design a naming convention for your DTS packages and document it. Make sure all your DBAs and DTS developers adhere to this naming convention while developing DTS packages. Here's the naming convention that I follow:
<Module>: <Package Function>: <Optional Part Number>
For example:
Coupon Allocation: Process Coupons
Coupon Allocation: Allocate Coupons
Coupon Allocation: Publish Coupons: Part 1
Coupon Allocation: Publish Coupons: Part 2
You may ask, why follow a naming convention at all for DTS packages? Here's why: When sorted by name in Enterprise Manager, a good naming convention helps group all the DTS packages together, that belong to a particular module, as shown in the following picture. This helps you locate the package in question easily, when troubleshooting a problem. When your pager goes off at 2 AM in the morning, the last thing you want to do is to trawl through a huge list of DTS packages, to locate the problematic package.
Documentation
Documentation is as important in DTS, as is with any program. You design a complex DTS package, and move on to your next job. How is the new DBA supposed to maintain your DTS package, if there isn't enough information provided? Always create a document that describes the complete flow and functionality of the DTS package, and save it in a central repository like Visual Source Safe (VSS).
DTS Designer also provides you a way of writing comments inside the package itself. If you think something in the package is not very obvious, be sure to add a comment near that step. You could do so, by selecting 'Add Text Annotation' from the 'Package' menu. After writing the comment, you could even adjust the color, size and other properties of the font, as appropriate for your package. Be sure to save the package immediately after adding the comment. There's a nasty bug in DTS Designer, that forgets to prompt for saving, when you close the package after adding a comment. The following picture shows you an SQL Execute task along with a comment:
Troubleshooting DTS Packages:
DTS packages when scheduled as jobs, will not provide any useful information when they fail. When a scheduled DTS package fails, the job history may not always show you the exact error message, that caused the package to fail. Your only option in this case, is to re-run the package interactively and see what happens. To avoid this painful situation, always add a error log file to your DTS package. DTS appends all the status/progress information as well as complete error messages at the step level, to this file, every time the package runs. You can also log this information to SQL Server tables in msdb. Logging to tables is especially useful, if you want to generate statistics like 'on an average, how much time each step is taking to complete, in your critical DTS package'.
To configure logging for your package, right click on the package in DTS Designer, select 'Package Properties'. Go to 'Logging' tab. In this tab, specify the error file, as well as the SQL Server to which you want to log the package execution. When logging to SQL Server, the information will be logged into sysdtspackagelog and sysdtssteplog tables in the msdb database.
Note that, DTS always appends to the error file. So, overtime, you will end up with a huge log file. how to create a new file every time the package runs.
While this type of logging helps you find the exact error messages, and identify worst performing steps in the package, you will still need your own logging to detect logical data errors. For example, you could output your own messages to a text file, using an ActiveX script task and FileSystemObject object, that can be reviewed later.
When a scheduled DTS package fails, and you need to run it interactively, then make sure you are running the package on the server itself, by logging into the server directly. There is a lot of confusion about, where exactly a DTS package runs. Hopefully the following will clear things up: DTS is a client side application. DTS package when scheduled using SQL Agent as a job, it runs on the server. When you run the package interactively, then the package runs on the local machine. That is, you connect to a remote SQL Server using Enterprise Manager on workstation, the DTS packages you run will run on your workstation. This will result in slower package execution as the data flows over the network, between your local machine and the server. If you cannot log onto the server directly, then you need to make sure all the files, servers referenced in the DTS package are accessible from your machine as well. You also must copy any UDLs (Universal Data Link) used by the package onto your local machine, to the same location.
Performance of DTS Packages:
For importing text files into SQL Server tables, use the built-in 'Bulk Insert Task', instead of the usual 'Transform Data Task'. Bulk Insert task is the fastest way for importing text files into SQL Server tables or views. While it is the fastest, you cannot perform data transformations with Bulk Insert task. With smaller text files, you may not notice a big difference in speeds between 'Transform Data Task' and 'Bulk Insert Task', but the difference becomes apparent when the text files start to grow in size. Bulk Insert task will perform much faster when importing from local text files (instead of over the network).
Parallelize your tasks to improve performance of your DTS Package. That is, if you have two tasks that are independent of each other, then run them in parallel, instead of running them serially. Running tasks in parallel will make use of all the processing power of the server, and complete the overall work much quicker. This can be configured using Workflow in DTS Designer. Make sure all the steps involved in the parallel processing are configured to use their own connections. The following pictures show how serial and parallel tasks appear in DTS Designer:
Steps running serially | Steps running parallelly |
First picture: 'Import Orders' step runs first, then 'Import Products', then 'Create Reports'.
Second picture: Both 'Import Orders' and 'Import Products' steps will run parallelly, and when both of them complete, the last step 'Create Reports' runs.
Similarly, if you have a module that makes use of multiple DTS packages, identify those packages, that can run in parallel, independent of each other. After identifying those packages, run them in parallel, from another package, using 'Execute Package' task. DBAs often refer to this as, a master package running child packages. Using the 'Execute Package' task, you can even pass data from one package to another using inner package global variables and outer package global variables. See SQL Server Books Online for more information.
When using 'Transform Data Task' to transfer tables from source to destination, try to use a single transformation for all columns, instead of using an individual transformation for each column. I personally tested this and found the single transformation to be much faster, especially with a large number of columns. Note that other types of transformations like 'ActiveX transformation' tend to be slower than the 'Copy Column' transformation. By default the 'Transform Data Task' creates a new transformation for each column pair. You'll have to delete all those transformations, select all columns from the source side as well as target side and click on the New button. Choose 'Copy Column' transformation and create a single transformation for all columns. The following pictures show how single and multiple transformations look graphically, in a 'Transform Data Task':
Individual transformation for each column | One transformation for all columns |
To achieve high speed data loads, consider removing indexes, constraints, triggers on the target table. You could temporarily remove/disable these objects and re-enable them after the data load. This dramatically improves the data load performance. When using 'Transform Data Task', make sure the 'Use fast load' check box is checked (default) in the 'Options' tab. If appropriate for the table, check the box against 'Table lock' and uncheck the box against 'Check constraints'.
Sometimes text files contain different kinds of records that eventually end up in different tables. In this situation, I've seen developers processing the text file line by line, using ActiveX scripting. This is a very bad thing to do. Handling huge text files one line at a time will take ages to complete. A better approach would be to create a table in SQL Server, and load the complete file into this table quickly, using 'Bulk Insert' task. Now this table will serve as a staging table. Once the data is in the table, just use simple SELECT queries to extract different rows into different tables.
In some cases, the text files you receive from data suppliers are not very friendly when it comes to their format. Again, avoid using ActiveX script tasks to pre-process these files. Instead, use a low level language like C or C++ to preprocess the file before it gets imported into SQL Server. Programs written in languages like C perform file I/O operations much faster, compared to ActiveX scripting.
It is common to end up with huge transaction log growth (possible out of log space errors) when performing huge update or delete operations from 'Execute SQL Task'. To overcome this problem, consider batched updates or deletes. That is, instead of updating all the rows at once, update chunks of rows, and clear the transaction log after processing each chunk. Batched operations also help minimize locking. Here's an example:
SET ROWCOUNT 1000
WHILE 1 = 1
BEGIN
UPDATE Basket
SET Inactive = 1
WHERE ItemAddedDate < DATEADD(dd, -30, CURRENT_TIMESTAMP) AND Inactive = 0
IF @@ROWCOUNT = 0
BEGIN
BREAK
END
ELSE
BEGIN
/* CHECKPOINT will work for databases in simple recovery mode.
Replace it with a BACKUP LOG command, if your database is in full recovery mode */
CHECKPOINT
END
END
SET ROWCOUNT 0
'Data Driven Query Task' (DDQ) is not always a solution for processing/applying changes between two tables. Try to stay away from DDQs, if you can, as they tend to be very slow, because they handle one row at a time. This can be easily verified using Profiler (SQL Trace). Changes can be easily processed using SQL statements directly from 'Execute SQL Tasks' as shown below:
/*Delete rows from target, that are not there in the source table*/
DELETE dbo.Target
FROM Target AS t
WHERE NOT EXISTS
(
SELECT 1
FROM dbo.Source AS s
WHERE s.title_id = t.title_id
)
/*Insert rows from source, that are not already there in target table*/
INSERT INTO dbo.Target
SELECT *
FROM dbo.Source AS s
WHERE NOT EXISTS
(
SELECT 1
FROM dbo.Target AS t
WHERE t.title_id = s.title_id
)
/*Update matching rows between source and target.
(Matching primary keys, but different non-key values)*/
UPDATE dbo.Target
SET title = s.title,
type = s.type,
pub_id = s.pub_id,
price = s.price,
advance = s.advance,
royalty = s.royalty,
ytd_sales = s.ytd_sales,
notes = s.notes,
pubdate = s.pubdate
FROM dbo.Target AS t INNER JOIN dbo.Source AS s
ON t.title_id = s.title_id
WHERE
(
(t.title <> s.title)
OR (t.type <> s.type)
OR (COALESCE(t.pub_id, '') <> COALESCE(s.pub_id, ''))
OR (COALESCE(t.price, -1) <> COALESCE(s.price, -1))
OR (COALESCE(t.advance, -1) <> COALESCE(s.advance, -1))
OR (COALESCE(t.royalty, -1) <> COALESCE(s.royalty, -1))
OR (COALESCE(t.ytd_sales, -1) <> COALESCE(s.ytd_sales, -1))
OR (COALESCE(t.notes, '') <> COALESCE(s.notes, ''))
OR (t.pubdate <> s.pubdate)
)
Notice the use of COALESCE in the WHERE clause of the above UPDATE. If you have NULLable columns, you must use COALESCE or ISNULL function to convert the NULL to some comparable value, before you can compare a NULL and non-NULL value.
If you do end up using 'Data Driven Query Task', be sure to use a separate database connection for the lookup to insure the read and write operations of the data pump are not serialized by using the same connection for performing lookups.
DTS Packages and Security:
It is important to secure your DTS packages, so that no unauthorized users can run or modify those packages. When saving DTS packages to SQL Server, you will get an option to specify a user password and an owner password. Setting a user password ensures that only those people who know the password can execute the package. Once the owner password is set, one cannot modify the package without knowing this password
It is very likely that some of the files you generate using DTS contain confidential information, in which case pay special attention to where these files are getting written to. Output these files to a secure location and make sure only authorized users have access to those files and folders. This can be controlled by setting NTFS permissions at the Operating System (OS)/Windows level.
Deployment, Maintenance, Usability and Portability of DTS Packages:
Very often DTS packages are developed in the development environment, then get moved to test environment and end up in the production environment. This involves moving DTS packages from one SQL Server to another. There are several methods available for moving packages from one server to another. One of the most common methods is to store the DTS package as a structured storage file (.DTS file), move the file to the target server, open it from Enterprise Manager and save it to that SQL Server. Other methods include copying the contents of msdb..sysdtspackages from one server to another. You could also use the DTS object model to load a package and save it to a different SQL Server. These methods will take care of moving the packages between the servers, but what about the file locations, server names that are referenced within the DTS packages? Manually editing the connection properties is a tedious task and is very much error prone. But not to worry, we do have some options to make the process of moving packages between different environments easier.
The built-in 'Dynamic Properties' task lets you store all the variable parameters (like file names, SQL Server names, FTP locations etc.) separately, and assigns them to your package tasks at runtime. When you move the package to a new environment, you only have to update the 'Dynamic Properties Task' to suit the new environment, without having to touch any of the other tasks. 'Dynamic Properties Task' provides various ways for storing these parameters. You could store these values as constants within the 'Dynamic Properties Task'. You could use an initialization file (.ini file) to store the values, in which case you don't have to touch the package after moving it to a new environment. You just have to edit the .ini file, using Notepad or other text editor of your choice. You could also store values in global variables. As you'd have realized by now, a 'Dynamic Properties Task' lets you control the package centrally, without having to edit all connections and file paths manually.
Important Note: When using 'Dynamic Properties Task', configure the workflow to make sure it gets executed as the first step of the package. Otherwise you will get some weird errors like the following:
OleDBProperty 'language' already exists in the collection
UDLs (Universal Data Link) are similar to DSNs, in that they store the information required to connect to a data source, except that UDLs are used to connect to OLEDB data sources. Creating a UDL is easy. Just create a blank file with .UDL extension and then double click it to set the connection properties. UDLs can be used by many of the built-in DTS tasks for their connection settings. A major advantage of having UDLs is that, when you want a set of DTS tasks to point to a new server, you simply edit the UDL file. When working with UDLs, make sure the same UDLs are present on the server as well as your development machine, at the same location.
Similarly, Global Variables can also be used to maintain and share a piece of information between different steps of a package and between different packages. When using Global Variables, check the 'Explicit Global Variables' check box in the 'Global Variables' tab of the 'Package Properties' dialog box. This makes sure that all the global variables are well defined, avoiding bugs and problems with your code.
When you schedule a DTS package using Enterprise Manager (by right clicking on the package and selecting 'Schedule Package' from the menu), a job gets created with the desired schedule, but the job step contains the DTSRUN parameters in encrypted form. Though encrypting parameters is a good idea from a security point of view, it becomes very difficult to identify what package is being run with what security credentials etc. Here's how an encrypted call to DTSRUN looks like:
DTSRun /~Z0x4C5CE376C3E8E8EC253514EAA52C01B55663AB4ACE831E7AD242B477A71A0AB06369ED857944CACD88D075
To avoid this, edit the job step and provide plain text parameters, as shown below. Since the following call to DTSRUN is using Windows Authentication, there's no security risk of displaying passwords in clear text:
DTSRun /S MySQLServerName /E /N "My DTS Package Name"
I found xp_smtp_sendmail to be extremely reliable and easy to use, for sending emails from within DTS packages. This extended stored procedure can be downloaded and installed from http://SQLDev.net.
In addition to saving DTS packages to SQL Server, also save a copy as a structured storage file (.dts file). This is especially useful, when it comes to version control. You can check these files into Visual Source Safe (VSS). Alternatively, you could also save DTS packages as Visual Basic (VB) files (.bas) and check these files into VSS.
Every time you save a DTS package to SQL Server, a new version is created. So, you will see multiple rows for the same DTS package in the msdb..sysdtspackages table, one row for each version. In case of an old DTS package that went through considerable development, you will see several versions in sysdtspackages. If you don't really need all those older versions of the package, then delete them, as it saves space in the msdb database. From within Enterprise Manager , you could right click on a package, and select 'Versions' to see all versions and delete them selectively.
Turn on DTS caching, to prevent the DTS Designer and DTS wizard from enumerating the OLE-DB providers, Active/X scripting engines, DTS Tasks and DTS Transforms, every time from the registry. This will improve the package load times. DTS caching is turned off by default. The following procedure turns on the caching:
In Enterprise Manager, right click on the 'Data Transformation Services' node, and select 'Properties' to bring up the 'Package Properties' dialog box. Check the box against 'Turn on cache' to enable caching. You could periodically refresh the cache by clicking on the 'Refresh Cache' button.
Special attention needs to be paid, when it comes to handling files from DTS packages. Here are some quick tips on file handling:
Do not fail on missing data files: Packages expect some files to be present at a particular path and some of those files happen to come from different sources (including third parties). If it is acceptable in your environment, do not let the DTS package fail with missing files. It's not nice to page a DBA at mid-night, when you can live with it...isn't it? Simply check the file existence from your package using an 'ActiveX Script Task' and continue as usual if the file exists, otherwise simply disable the rest of the steps. Code samples for this can be found at sqldts.com
Keep a backup of processed files: Imagine a scenario where you get a product catalog from a third party product supplier. You pull the file from an FTP location, process it into your tables, and delete the downloaded files. Now you realize that some of the products are showing up with a price of 0 and customers are happily purchasing those products. You think you received incorrect price information from the supplier, but the supplier thinks there is a bug in your catalog processing code. But you can't prove your point, because your DTS package deleted the files at the end of the process. To avoid such problems, design a mechanism where processed files get moved to a specific location, and they get deleted from that location if they are a week old (or month etc.).
FTP is not reliable: Yes, FTP is not reliable. You may not get any errors, but the file you downloaded could be incomplete. If reliable data is critical for your environment, then consider implementing some kind of cheksum mechanism, that verifies that complete file is downloaded. Some people also add a trailer bit at the end of the file.
Last but not the least: Do not display message boxes (with MsgBox function from ActiveX Script tasks, using VBScript). A message box waits for user to press OK, which is not possible, when the package runs as a scheduled job. So, it ends up hanging the job. It is okay to have message boxes, during the development and debugging stages, but be sure to remove those, before the DTS package goes into production.