Systems Engineering and RDBMS

Archive for March 12th, 2007

Migration/Conversion/Upgrade approaches for RDBMS

Posted by decipherinfosys on March 12, 2007

When upgrading from one version of the application to the next, the schema undergoes changes and the upgrade scripts need to be applied in order to do the schema upgrade. Depending upon the amount of changes done between the versions, this can be a minimal change (only some schema objects like procedures, views, functions etc. get upgraded) vs a massive change where re-names occured, normalization or de-normalization changes occured, new objects were created, existing ones were modified including addition of columns, indexes and miscellaneous schema objects.

In the industry, there are two approaches that are commonly adopted for doing migrations for the products from a lower version to a higher version. The two approaches are:

1) Use DDL (Data Definition Language) scripts along with Update and Insert scripts. The DDL scripts are Alter scripts, create table scripts etc. and the Update/Insert DML (Data Manipulation Language) scripts are the scripts that are used to update/insert the data into the tables – this could be configuration data or could be updates that are needed due to the schema changes for the new release. In this scenario, there is no destination schema since the changes are directly applied to the Source Schema itself and that Schema is upgraded from one version to the next.

2) Use an ETLM (Extract, Transform, Load and Modify) script to get from the Source Schema to the Destination Schema. The source schema is the older version schema for a product and the destination schema is the final version schema of the product.

Here are the pros and cons of both the approaches. These steps below are written with Oracle in mind.

Approach# 1 (the DDL approach) uses an ALTER script (or a set of ALTER scripts that are run in sequence) to modify the structures, create new objects etc. and an UPDATE script to update the data, re-set the data

Approach# 2 (the ETLM Approach): The steps adopted in this approach as follows:

a) Configuration tables are populated in the Destination Schema. This takes minimal amount of time.
b) Indexes are dropped for the tables in the Destination Schema. Please note that the Primary Key Indexes are not dropped and the foreign keys are maintained.
c) The tables are changed to be in nologging mode.
d) We use direct path insert statement (in parallel mode) to load the data into the tables. Please also note that this insert statement has the join conditions specified with the rest of the configuration tables to populate the data for the child columns. Since the join is from a child table to the parent, atmost 1 matching record will be found for a value. This essentially combines all those updates into a single statement.
e) Once the data is loaded, the indexes are created in nologging and parallel mode.

A brief description of the direct path inserts:

Direct-load INSERT enhances performance during insert operations by formatting and writing data directly into Oracle datafiles, without using the buffer cache. This functionality is similar to that of the Direct Loader utility (SQL*Loader). This is a common design methodology:

a) The tables in question are in NOLOGGING mode all of the time
b) Use the direct path insert to populate them — do not use regular inserts. Do not have any indexes except the index needed to enforce the Primary Key (that way, when loaded, they will not generate undo or redo). We keep the Primary Key so that we do not have to pay for the cost of enabling the constraints later on.
c) Create non-PK indexes on the table. When you add the indexes — do it in PARALLEL with NOLOGGING.
d) After doing these operations — simply backup (hot backup) the affected tablespaces (the tablespaces with the table / index data ). Add that to your backup set.

That is it.

nologging does direct path writes so we are not writing the data to the buffer cache. Thus, we are only waiting for the physical IO to complete. Parallel direct path inserts are faster than filling up the buffer cache and waiting for dbwr to write them out. We write directly to disk, we wait for that IO to complete – we do the writing, not DBWR, not LGWR. Since we didn’t fill the buffer cache, we didn’t really have to wait for LGWR.

Advantages of Approach #2 (ETLM) over Approach #1 (DDL):

Advantages of the ETLM (Extract, Transform, Load and Modify) process over the DDL and update script methodology are as follows:

1) Time taken is less. This largely depends upon the volume of your data though. We have done tests on 200 – 500GB schemas and found this approach to be much faster.

2) Original Schema remains un-touched. So, in case something goes wrong, there is a back-up plan instead of having to restore the old schema which would take time depending upon the data (though this can be mitigated by having 2 copies of the original schema to start with – but then it is extra effort and extra disk space).

3) There is no issue of internal fragmentation in the case of the ETLM process – in-fact, it gets rid of any existing internal fragmentation. As you are well aware of, with the passage of time, due to DML operations, fragmentation can occur in a schema. The ALTER commands followed by the UPDATE statements to update the values can create more fragmentation issues and issues related to row chaining/row migration which is bad for performance reasons. Such issues do not exist in the ETLM methodology.

4) It is very easy to start from where you left. Say we have to migrate 1000 tables from Source to Destination using ETLM process and say we face an error in the migration script for table # 551. A CONFIG_CONVERT table can be created that has flag columns to keep track of where we are in the process and will pick up right from there. In the other process, if there is an error due to data issues, unless proper error handling has been designed, it is not that straight forward.

5) The initrans and pctfree settings for the blocks can change between releases of the products. If the Alter + Update script methodology is followed, then one would need to extract the data out and re-load it after making these changes since these changes do not get persisted for existing blocks unless a reload is done. Please note that initrans is not dynamic (maxtrans is and has a setting of 255 by default in 10g).

6) Huge update statements get replaced with a direct path insert statement so instead of firing say 20 update statements, one single insert statement (with proper joins) needs to be done.

7) Using the ETLM process, there is never an issue of a missing ALTER script (though if the DBA follows the right process, this is less of a concern). But this also ties in with the customer experience who get a single conversion script with documented steps rather than a bunch of DDL scripts.

8 ) ALTER statements for columns that are NOT NULL with/without a default value become cumbersome since it will take a long time. A simple test of adding a NOT NULL column with a default value shows that the time taken would be substantial. And the addition of NOT NULL columns with a default value or a NOT NULL column w/o a default value is pretty common when migrating between versions of a product.

9) In DB2 where the re-name of the column is not allowed, one has to drop and re-create it which means extra code.

Disadvantages of the ETLM approach over the DDL approach:

1) Space requirements increase since one needs to have more than the equivalent amount of space available because of the 2 schema approach.
2) Requires development time to put together the conversion script and the testing effort.

Posted in DB2 LUW, Oracle, SQL Server | 1 Comment »

Truncate vs Delete

Posted by decipherinfosys on March 12, 2007

Both these commands can be used to remove data from a table.  However, there are significant differences between the two.  Truncate command is faster because it does not have the resource overhead of logging the deletions in the log.  It also acquires less number of locks and the only record of the truncation is the page deallocation.  Thus the records removed using this command cannot be restored.  Command wise, you cannot specify a where clause for this command.  Besides the advantage of being faster (due to minimal logging), another advantage in the case of SQL Server is that it re-sets the IDENTITY value back to the original value and the deallocated pages can be re-used.  Besides the limitation of not being able to restore the data, another limitation is that it cannot be used for tables that are involved in replication (or log shipping in the case of SQL Server) and it cannot be used on the tables that are referenced by foreign keys.  In addition, this command does not fire the triggers.

Delete command on the other hand logs each and every row in the log.  It consumes more database resources and locks.  However, the data can be restored easily, you can specify a where clause and the triggers get honored.  In the case of SQL Server, you would need to re-seed the identity value using the DBCC CHECKIDENT command once you are done with the delete statement.

TRUNCATE is typically used in data warehousing applications for removal of the data in the staging tables while doing the loads.  It is also used in maintenance scripts by the DBAs/Developers.  In actual application logic, delete is the right choice because of the limitations that have been mentioned above for the truncate command.

Point of Caution: The behavior of truncate is different between Oracle and SQL Server.  In the case of Oracle, it is a DDL command and hence does an implicit commit which means that you cannot rollback truncate in the case of Oracle.  In the case of SQL Server, that is not the case and you can rollback a truncate command in SQL Server.

Posted in Oracle, SQL Server | 1 Comment »