What can be done to optimize (minimize execution time of) bulk insert operations of large *.csv files into a SQL Server database ? Here you may find an excellent article on this subject.
But there is also a simple thing you should consider: autogrowth events. These operations can slow down some operations executed on a SQL Server database (including bulk insert operations). This means that if we eliminate these auto growth events the execution time will decrease.
To test this hypothesis I’ve done two sets of tests using a *.csv file (4.366.563 rows) which has a size of 462 MB uncompressed:
-
One test using default database options (for size and auto growth) and
-
The other test use a database with an initial size of 1GB and a log file with an initial size of 500 MB. Because source file uses UTF-8 encoding the text columns are defined using NVARCHAR data type, for every char SQL Server uses two bytes (without data compression).
I’ve used following scripts:
-
First test:Code SnippetCREATE DATABASE TestBulkInsert
ON PRIMARY (NAME = 'TestBulkInsert_01_Data', FILENAME = 'E:\BD\TestBulkInsert.mdf')
LOG ON (NAME = 'TestBulkInsert_01_Log', FILENAME = 'E:\BD\TestBulkInsert.ldf')
GO
ALTER DATABASE TestBulkInsert
SET RECOVERY SIMPLE
GO
USE TestBulkInsert;
GO
SELECT file_id, name, size * 8 AS [size KB],
CASE
WHEN growth = 0 THEN 'fixed size'
WHEN is_percent_growth = 1 THEN STR(growth, 10, 0) + ' %'
ELSE STR(growth * 8, 10, 0) + ' KB'
END AS growth_description
FROM sys.database_files
GO
/*
file_id name size KB growth_description
——- ———————- ———– ——————
1 TestBulkInsert_01_Data 3136 1024 KB
2 TestBulkInsert_01_Log 1024 10 %
*/ -
Second test:Code SnippetCREATE DATABASE TestBulkInsert
ON PRIMARY (NAME = 'TestBulkInsert_01_Data', FILENAME = 'E:\BD\TestBulkInsert.mdf', SIZE = 2GB, FILEGROWTH = 500MB)
LOG ON (NAME = 'TestBulkInsert_01_Log', FILENAME = 'E:\BD\TestBulkInsert.ldf', SIZE = 100MB, FILEGROWTH = 100MB)
GO
ALTER DATABASE TestBulkInsert
SET RECOVERY SIMPLE
GO
USE TestBulkInsert;
GO
SELECT file_id, name, size * 8 AS [size KB],
CASE
WHEN growth = 0 THEN 'fixed size'
WHEN is_percent_growth = 1 THEN STR(growth, 10, 0) + ' %'
ELSE STR(growth * 8, 10, 0) + ' KB'
END AS growth_description
FROM sys.database_files
GO
/*
file_id name size KB growth_description
——- ———————- ———– ——————
1 TestBulkInsert_01_Data 2097152 512000 KB
2 TestBulkInsert_01_Log 102400 102400 KB
*/
Target table was created using the next script:
[ColA] [bigint] NOT NULL PRIMARY KEY,
[ColB] [nvarchar](20) NULL,
[ColC] [nvarchar](4000) NULL
) ON [PRIMARY]
GO
To import data I’ve used an SSIS package:
Bellow are my results:
Test 1 | Test 2 | Difference | |
Average execution time (sec.) | 326 | 222 | 104 |
Average execution time (mm:ss) | 5:26 | 3:42 | 1:44 |
Autogrowth events | 946 | 0 | 946 |
As you can see, simply eliminating auto growth events can reduce significantly the execution time when importing large files. This is possible because autogrowth events are expensive.
Note: It’s worth reading about Instant File Initialization (not used in these tests).