Analysis in QlikView is lightening quick, millions of rows can be aggregated and manipulated with very little latency. Getting those millions of records off a server elsewhere on your network – now that can be a different matter. This is where a robust QVD strategy and QlikView Incremental Loads become crucial.
I Want My QVD
It is always pushed as the ‘correct’ way of doing things to have a QVD layer in any QlikView implementation. In practice there are very few scenario’s where it does not make things easier to have a QVD layer. Even if you are not in a situation where your are pre-processing the data once before using these in multiple QlikView apps; the ability to logically segregate data extraction and manipulation from data presentation can make development easier and less prone to errors. The incredible speed that data can be loaded from QVD means that unit development and test of just part of the load process before refreshing presentation apps can be done without lots of waiting around. I’ve blogged on getting data design right previously (Perfect Your Data Model) and on the advantages of Optimised QVD loads – so I won’t repeat myself here.
Suffice to say though, whilst QVD’s are a ‘good idea’ generally, when it comes to Incremental Loads they are essential.
Why Do Incremental Loads?
As BI apps are expected to deal with larger and larger amounts of data the amount of time that it takes to retrieve that data becomes a serious issue. This could be due to shear volume of data or the need for frequent refreshes. Either way, you do not want to be pulling all of the data all of the time. What you want to be able to do is just pull the data that has changed, append to that the data that you stored away previously and then get back to the business of analysing. This will reduce load on the source database, the network infrastructure and your QlikView server.
There are many different ways of implementing incremental loads in QlikView, and the choice of approach should depend entirely on the data that you are extracting.
Here are just a few examples of the approaches you can use.
Date Stamped Append Only Table
The easiest type of data to build an incremental load over is a table where new rows can only be appended to the end and old rows will never change. One scenario with these tables is that a date stamp is written to each row. Examples of these types of table are system logs or records of transactions that can’t be amended – only reversed with subsequent transactions.
If the data volumes here are not massive, a kind of pseudo incremental load can be done, using the fact that QlikView can load multiple files with a wildcard (as I’ve shown in this Video posting). Each time a data extract is done a WHERE statement can be done on the SQL select to only bring back the current day (or month, depending on volumes). When the QVD is written the date can be encoded into the filename, eg. TransactionHistory_201301225.qvd.
When data is then extracted from the QVD it can be loaded with a wildcard, eg. TransactionHistory_*.qvd. This approach is very simple, and avoids having to concatenate QVDs. You do however need to careful to ensure that no data at the end of one period is not missed out as you have moved on to the next period before it is pulled. The best way to ensure this is to loop through multiple periods, for example with each load refresh both the previous days file and the current day (or perhaps just do this until 2am if loading hourly). Data volumes and refresh intervals will naturally inform decisions as to what is the most sensible approach here.
With anything involving dates make sure that you give proper consideration to timezones and daylight saving times – as these can cause problems if you are not careful.
Append Only Table With A Unique Sequential ID
If the table again can only have rows appended and there is a unique sequential ID then a different approach can be used. The sequential ID could be a system generated number added by the database engine, or it could again be a date – if dates are recorded to the millisecond and it is not possible to have more than one transaction in a split second.
The first approach was straight forward, but it did potentially mean data was being pulled many times during the period each QVD was for and it also meant that many QVD files were being created. The multiple files would take more space on disk and load slightly slower than if all data was included in a single file. Both of those negative aspects can be avoided by always pulling data from a specific point in the table.
To use this approach the SQL WHERE statement will only pull records where an ID is greater than the last record that was pulled the previous time. This value can be read from a config file created at the end of the previous run. When all data is read in from the defined point forward a peek statement can be done to find the latest ID (ensure rows are read in ID order when using this approach), this can be put into a variable then a table and finally written to a CSV file with a STORE statement.
A similar approach to the first example could be used when the new data was pulled, so it was written into its own QVD with a suffix on – perhaps the last ID in that batch. A better approach may however be to append the previously saved data to the new data – resulting in a single, growing, QVD. As the previously created QVD will be in exactly the same structure a simple load from QVD will concatenate to the new data creating a single table in memory. This in memory table can then be stored back over the top of the existing QVD. As we are here being very precise and careful with the ID we are loading from and up to we can be confident that the resulting QVD will have each row only once and not be missing any rows.
Again, with this approach you need to ensure your assumptions about the data and the ID are correct and take care with the appending. If the size of the dataset allows it is always good to have a fallback routine that does actually pull all data from source – perhaps just run over the weekend to ensure data consistency.
Free Form Data With Modification of Existing Rows
Unfortunately we often are not living in a simple world where ID’s are sequential and any row once created can not be changed. Often changes to rows can be made at any time and the data we have in our QVD’s can become stale and need to be replaced. This can still be dealt with using a QlikView incremental load, it just needs a bit more work. Two things are however required to allow us to do this, firstly a timestamp or audit table to say something has been changed, and secondly a unique key. If the database provides a unique key then this can be used, if not one may need to be created by concatenating fields until uniqueness can be guaranteed (perhaps combining Operator Code and DateTime for example – if you know an operator can only make one change per second). This key will be used later.
As with all the other incremental loads described the SQL statement used to extract the data will have a WHERE statement on it. This time it will be to return rows where the timestamp is since a specific point in time. Unlike with the previous load where we had to be very precise with the point we were loading from; we can allow ourselves some slack here – perhaps on a daily load you may chose to always pull four days of data, so if the load failed for a couple of days it would still backfill the data that had been missed when it next ran successfully.
When the data has been loaded we will have some brand new rows and some that exist in the QVD we have already built and been adding to previously. We still want to do an append from the QVD to concatenate to the new data – but this time we will explicitly exclude any rows that we have just pulled from the database with a WHERE statement in the QlikView load. The syntax is simply WHERE NOT EXISTS (IDField).
Only rows where the ID is not already found will be loaded from the existing QVD. This is why uniqueness of ID is critical, as even if the ID exists twice in the source it will only be kept once in QlikView. Again, after the data has been combined it can be stored back over the existing QVD.
This approach is perhaps the most robust of the incremental load methods I have described here – and it is certainly the way I have implemented incremental loads most frequently out on client site.
As with any system build, due consideration has to be given to incremental loads. A significant amount of bullet proofing may be required around the solution. For instance – you should code a check to see if the existing QVD is there before attempting to append it, so that the code runs first time before a historical file exists. Make sure there is a plan in place for dealing with database or network failures. A flag at the start of the load that changes it from an incremental load to a full one is a good ‘get out of jail free’ trick, so if there is any concern about the integrity of the data it can be pulled in its entirety from source. If data volumes, or even the existence of historical data beyond a certain point (eg. where logs get purged from the source system but are required in QlikView) mean that a full refresh is impossible it is absolutely critical that QVD files are backed up. The creation of a backup can even be written into the QlikView load script – so there is not reliance on an external process.
Two final points that are important anywhere, but have specific relevance to incremental loads; ensure that loads from QVD’s are Optimised wherever possible (Google this if you are not aware of Optimised vs. Non Optimised loads) and make sure that if you are loading from multiple sources into a single table you ensure that the columns from all sources are identical (if they are similar but not identical a massive synthetic key can be created and things can fall over very quickly).
Hopefully this has given food for thought as you go on and design your own incremental data loads.