Friday, August 27, 2010
Tuesday, September 8, 2009
Fact Tables: Join or not to Join?
Many times, it is imperative to join two fact tables in a DW, because the reports (rather queries behind the reports) need it.
In general, it is not a good idea to join two fact tables. Reason is: When you join (inner join) two tables on a condition, only the rows that satisfy the condition are returned. So, if you join two fact tables, factA and factB, then, only those rows from factA would be returned, that have matching rows in factB, and non-matching rows would be dropped (and hence, won’t show up in reports). This may or may not be consistent with the business rules. For example, the situation where there is a row in factA with no matching row in factB may perfectly be a valid business situation. However, the join will drop the row. So, effectively we are eliminating a valid row from the report.
Work-arounds:
- Outer join may be a solution. An outer join will return ALL rows from factA. For non-matching rows, it will return null values for the factB fields. This may or may not be OK, depending on the business logic. For example, many times, reports are designed to fill in dummy values like ‘Unknown’, ‘Undefined’ etc, when they see a null value. So, the users may be fine with this. But again, if you are going to use a field from factB, to fetch another field from another table, then this strategy may not work.
- Another approach goes back to database (ETL) design. The ETL designer should be (made) aware of the reports that are going to be generated out of this system. And, the database should be so designed, that we won’t need to do a join.
In general, it is not a good idea to join two fact tables. Reason is: When you join (inner join) two tables on a condition, only the rows that satisfy the condition are returned. So, if you join two fact tables, factA and factB, then, only those rows from factA would be returned, that have matching rows in factB, and non-matching rows would be dropped (and hence, won’t show up in reports). This may or may not be consistent with the business rules. For example, the situation where there is a row in factA with no matching row in factB may perfectly be a valid business situation. However, the join will drop the row. So, effectively we are eliminating a valid row from the report.
Work-arounds:
- Outer join may be a solution. An outer join will return ALL rows from factA. For non-matching rows, it will return null values for the factB fields. This may or may not be OK, depending on the business logic. For example, many times, reports are designed to fill in dummy values like ‘Unknown’, ‘Undefined’ etc, when they see a null value. So, the users may be fine with this. But again, if you are going to use a field from factB, to fetch another field from another table, then this strategy may not work.
- Another approach goes back to database (ETL) design. The ETL designer should be (made) aware of the reports that are going to be generated out of this system. And, the database should be so designed, that we won’t need to do a join.
Thursday, August 20, 2009
Reports
Reports
In a DW environment, there are various ways to create reports. A report writing tool like Crystal Reports may be used. Many BI tools (like Business Objects) include tools like WEBi that enable users to create reports.
I will discuss the Reports testing in general terms, without getting into the details of any single tool.
All reports have front-end and back-end. Back-end is, of course, the DW database. Most reports access data from Fact tables.
Front end consists of the body of report, that is actually seen by the user: a chart (bar, line, pie, donut etc), or diagram (Venn etc), or just a 2-dimensional table.
Each report has an associated query (or set of queries) that runs on the database and retrieves rows. The logic for processing and displaying the retrieved data is another aspect of a report. In a BO environment, it's the Universe that contains all this logic.
Based on the discussion so far, following are the possible test scenarios for reports in a DW environment:
- Queries: As with any software, a report will (should) have a specification document. The document specifies which information is to be displayed, in which format.
Verify that the query (or queries) associated with the report does actually retrieve the data (information) from the database, that it is supposed to. This test is geared toward functionality of a report query.
Additionally, (if the scope provides for it), query may be tested for its performance (testing the joins, access paths, Explain etc).
- Output: Output of a report should be tested for the following:
. Report Inputs
. Report Layout (Bar, Pie, Donut, tabular etc)
. Filter conditions (correspond to WHERE clause)
. Grouping conditions (correspond to GROUP BY clause)
- Report Contents: This is the most important part of the reports testing. As a user, I run a report, I see the data, I am happy. BUT, how do I know that, what I see IS what it is? In other words, how do I know if the data displayed in the report is accurate? A billion-dollar question.
The answer goes back to the ETL process, that populated the DW database. If the ETL is designed properly, and it runs error-free, then we can have some level of confidence about the accuracy of the report. And that's why, it is almost always advisable to perform the ETL QA before Reports QA. I would go ahead one step further and say, that reports should not even be developed before the ETL is signed off. This may sound like paranoia, and may not always be possible in the real world.
There are two aspects to this issue. One, whether the data retrieved by the query is accurate. Two, whether the data processing logic (build into the report) is accurate.
Hence, two sets of tests are required.
- The associated query is run on the DW database. Remember, the data retrieved by the query is RAW data. It is not yet processed. This data needs to be compared to the source data (I mean, the source of the ETL). This can be littlebit tricky, since the source can be almost any data structure. If the source is also a database, then the report query can be 'translated' to run on the source database, and the retrieved data can be compared with the data retrieved by the reports query.
Again, remember, that the two sets of data (source and target) may not be identical, since the whole ETL logic is sitting in between them. So, some level of manual judgement is involved here.
If there are any discrepancies at this step, then it points to the ETL design (assuming that the reports query is accurate).
- Once we are sure that the data we got is accurate, then we need to know how it is processed, in order to bring it to the final output format. For example, if we have a bar chart, that shows percentages of revenue growth across regions, then we should process the raw data to calculate the required percentages and compare them with what we see in the report. This way, we can verify the report output.
In a DW environment, there are various ways to create reports. A report writing tool like Crystal Reports may be used. Many BI tools (like Business Objects) include tools like WEBi that enable users to create reports.
I will discuss the Reports testing in general terms, without getting into the details of any single tool.
All reports have front-end and back-end. Back-end is, of course, the DW database. Most reports access data from Fact tables.
Front end consists of the body of report, that is actually seen by the user: a chart (bar, line, pie, donut etc), or diagram (Venn etc), or just a 2-dimensional table.
Each report has an associated query (or set of queries) that runs on the database and retrieves rows. The logic for processing and displaying the retrieved data is another aspect of a report. In a BO environment, it's the Universe that contains all this logic.
Based on the discussion so far, following are the possible test scenarios for reports in a DW environment:
- Queries: As with any software, a report will (should) have a specification document. The document specifies which information is to be displayed, in which format.
Verify that the query (or queries) associated with the report does actually retrieve the data (information) from the database, that it is supposed to. This test is geared toward functionality of a report query.
Additionally, (if the scope provides for it), query may be tested for its performance (testing the joins, access paths, Explain etc).
- Output: Output of a report should be tested for the following:
. Report Inputs
. Report Layout (Bar, Pie, Donut, tabular etc)
. Filter conditions (correspond to WHERE clause)
. Grouping conditions (correspond to GROUP BY clause)
- Report Contents: This is the most important part of the reports testing. As a user, I run a report, I see the data, I am happy. BUT, how do I know that, what I see IS what it is? In other words, how do I know if the data displayed in the report is accurate? A billion-dollar question.
The answer goes back to the ETL process, that populated the DW database. If the ETL is designed properly, and it runs error-free, then we can have some level of confidence about the accuracy of the report. And that's why, it is almost always advisable to perform the ETL QA before Reports QA. I would go ahead one step further and say, that reports should not even be developed before the ETL is signed off. This may sound like paranoia, and may not always be possible in the real world.
There are two aspects to this issue. One, whether the data retrieved by the query is accurate. Two, whether the data processing logic (build into the report) is accurate.
Hence, two sets of tests are required.
- The associated query is run on the DW database. Remember, the data retrieved by the query is RAW data. It is not yet processed. This data needs to be compared to the source data (I mean, the source of the ETL). This can be littlebit tricky, since the source can be almost any data structure. If the source is also a database, then the report query can be 'translated' to run on the source database, and the retrieved data can be compared with the data retrieved by the reports query.
Again, remember, that the two sets of data (source and target) may not be identical, since the whole ETL logic is sitting in between them. So, some level of manual judgement is involved here.
If there are any discrepancies at this step, then it points to the ETL design (assuming that the reports query is accurate).
- Once we are sure that the data we got is accurate, then we need to know how it is processed, in order to bring it to the final output format. For example, if we have a bar chart, that shows percentages of revenue growth across regions, then we should process the raw data to calculate the required percentages and compare them with what we see in the report. This way, we can verify the report output.
Wednesday, August 19, 2009
Test Scenarios for a DW
I will discuss possible test scenarios for each of the components of the DW.
ETL
- Source-to-Target Mapping: Verify that, for each field (in each table) in the Target DB, there is one (and only one) source field. Verify that, corresponding source and target fields have matching data types, and matching lengths. Any mismatch here can result into disasters later on. For example, if a long source field is being copied into a shorter target field, truncation may occur.
One of the common pitfalls is decimal field. If the precision and scale are not matching correctly, the rounding error might occur. Like, source field has scale = 3, and target field has scale = 2, then the target field will round off the third decimal digit. So, 5.446 will result in 5.45 . Not so accurate.
Remember that, 'source' can consist of one (or a combination ) of the following: database tables, Access tables, EXCEL sheets, flat files etc. Hence, it is advisable to maintain a source-to-target mapping document.
- ETL Execution: Verify that the ETL execution is error-free. Common practice is that, the target database will have 2 tables: Audit and Error. After the ETL execution, the Audit field will store the following information: Rows extracted from source, Rows in Target before loading, rows in Target after loading. Error table will have the rows from the source, that didn't make it to the target, due to some error.
If there are any rows in the Error table, then one has to debug why those rows were dropped by the ETL.
- Row Count Reconciliation: This is the most important test of an ETL. The data in the source and target has to be identical. This is a little tricky subject. The data organization on source and target can be significantly different. On target side, it is a DW, comprising of Fact and Dimension tables, with FK and REF constraints. On source side, it can be almost anything. And these source-to-data configurations are different for different projects (to each its own), So, in general, there has to be a way, to determine that the data rows in a Fact table (or rather ALL the Fact tables) on the target side, and the data on the source side. This largely depends on the logic of the ETL and ETL transformations. And that's why, I call it a white-box testing. QA person has to know the ETL internals, in order to validate the data.
Database
Database tests, in the context of DW, include 2 major aspects: Referencial Integrity and Foreign Key constraints between Fact tables and Dimension tables.
- Check for Referencial Integrity constraint: If you try to delete a record from a dimentional table for which at least one row exists in a Fact table, it should stop you from doing so.
- Check for Foreign Key Constraint: If you try to update a key value in a Fact table, to a value that does not exist in the corresponding Dim table, it should stop you from doing so.
Exception: In some DW setups, it may happen that, Fact table may be refreshed daily, but Dim tables are refreshed weekly (or at a different frequency). In such exceptional cases, these kind of constraints should be used judiciously.
I shall cover the Reports Testing Scenarios in the next post.
ETL
- Source-to-Target Mapping: Verify that, for each field (in each table) in the Target DB, there is one (and only one) source field. Verify that, corresponding source and target fields have matching data types, and matching lengths. Any mismatch here can result into disasters later on. For example, if a long source field is being copied into a shorter target field, truncation may occur.
One of the common pitfalls is decimal field. If the precision and scale are not matching correctly, the rounding error might occur. Like, source field has scale = 3, and target field has scale = 2, then the target field will round off the third decimal digit. So, 5.446 will result in 5.45 . Not so accurate.
Remember that, 'source' can consist of one (or a combination ) of the following: database tables, Access tables, EXCEL sheets, flat files etc. Hence, it is advisable to maintain a source-to-target mapping document.
- ETL Execution: Verify that the ETL execution is error-free. Common practice is that, the target database will have 2 tables: Audit and Error. After the ETL execution, the Audit field will store the following information: Rows extracted from source, Rows in Target before loading, rows in Target after loading. Error table will have the rows from the source, that didn't make it to the target, due to some error.
If there are any rows in the Error table, then one has to debug why those rows were dropped by the ETL.
- Row Count Reconciliation: This is the most important test of an ETL. The data in the source and target has to be identical. This is a little tricky subject. The data organization on source and target can be significantly different. On target side, it is a DW, comprising of Fact and Dimension tables, with FK and REF constraints. On source side, it can be almost anything. And these source-to-data configurations are different for different projects (to each its own), So, in general, there has to be a way, to determine that the data rows in a Fact table (or rather ALL the Fact tables) on the target side, and the data on the source side. This largely depends on the logic of the ETL and ETL transformations. And that's why, I call it a white-box testing. QA person has to know the ETL internals, in order to validate the data.
Database
Database tests, in the context of DW, include 2 major aspects: Referencial Integrity and Foreign Key constraints between Fact tables and Dimension tables.
- Check for Referencial Integrity constraint: If you try to delete a record from a dimentional table for which at least one row exists in a Fact table, it should stop you from doing so.
- Check for Foreign Key Constraint: If you try to update a key value in a Fact table, to a value that does not exist in the corresponding Dim table, it should stop you from doing so.
Exception: In some DW setups, it may happen that, Fact table may be refreshed daily, but Dim tables are refreshed weekly (or at a different frequency). In such exceptional cases, these kind of constraints should be used judiciously.
I shall cover the Reports Testing Scenarios in the next post.
Wednesday, July 8, 2009
DW and QA Project - I
In the coming posts, I will discuss about a Data Mart from QA point of view.
As we know, data mart can be thought of as a smaller data warehouse.
From QA perspective, the entire DM project can be divided into following modules:
- ETL
- Database
- Universe
- Reports
Starting point for this project can be following documents:
- Data mart schema (fact and dim tables etc)
- Report specifications
First task is to come up with a test plan. One single test plan can be created, that will have 4 sections for each of the 4 modules (mentioned above).
A list of possible test scenarios should be included in the test plan.
After the test plan is created and approved, the next task would be to create test cases. That will be discussed in the next posting.
As we know, data mart can be thought of as a smaller data warehouse.
From QA perspective, the entire DM project can be divided into following modules:
- ETL
- Database
- Universe
- Reports
Starting point for this project can be following documents:
- Data mart schema (fact and dim tables etc)
- Report specifications
First task is to come up with a test plan. One single test plan can be created, that will have 4 sections for each of the 4 modules (mentioned above).
A list of possible test scenarios should be included in the test plan.
After the test plan is created and approved, the next task would be to create test cases. That will be discussed in the next posting.
Tuesday, July 7, 2009
Subscribe to:
Comments (Atom)