Recently I posted a solution to a thread on Qlik Community that I thought would be worth sharing with readers of this blog. The question was around speading up the load times from an increasing number of CSV files arriving in a drop folder. The solution I gave used a couple of helpful techniques. Here I look to take those techniques further.
The CSV Advantage
CSV files are a good way of shifting data between disparate systems. They are completely universal and most data tools have a way of dealing with them. However, if you are used to the speed you get from an optimised QVD load; CSV load times can be a drag. This is due to the fact that CSV files do not benefit from the clever way that QVD files are constructed, where repeated values are not stored on every line.
If you use QlikView a lot, at some point you will have to deal with many CSVs. So, how can you deal with them effectively?
The Basic Solution
In the Community post the question was around an increasing number of dated files. Each file containing rows to append to the existing data. Volumes were large, but not massive.
Given that QVDs load so much quicker than CSVs all we needed to do was load old data from QVD and new data from CSV – creating QVD files from the new CSVs as we go. The slow CSV load for each file would therefore only need to be done once.
The code I suggested to do this was as follows:
for each vFile in FileList('..\Data\*.csv') let vQVD = replace(vFile, '.csv', '.qvd'); if alt(FileSize('$(vQVD)'), 0) = 0 then NextFile: LOAD * FROM $(vFile) (txt, codepage is 1252, no labels, delimiter is ',', msq) ; STORE NextFile INTO $(vQVD) (qvd); DROP TABLE NextFile; end if next Data: LOAD * FROM ..\Data\*.qvd (qvd) ;
The key thing to note about this approach is how quick the file iteration and exists check takes place – this surprised me when I first tried this approach. When running over hundreds of files that have already been processed this routine still takes next to no time. The load from multiple QVDs at the end is optimised, so again is very quick.
The thing that I like about this approach is the simplicity. If, for some reason, you wanted to rebuild some or all of the QVDs you simply delete the QVDs you want to be rebuilt.
Taking It Further
There are many ways in which this approach could be improved upon. Here are a few thoughts on doing this.
I would always add some fields to any load – to make front end processing quicker. This can be done in this code when each CSV is loaded:
NextFile: LOAD 1 as RowCounter, filename() as SourceFile, * FROM $(vFile) (txt, codepage is 1252, no labels, delimiter is ',', msq) ;
If you wanted to reload a CSV file when it’s date has changed since the QVD was last built; the FileSize could be added to with a FileDate check.
if alt(FileSize('$(vQVD)'), 0) = 0 or FileTime('$(vQVD)') < FileTime('$(vFile)') then
If you wanted to be able to delete one of the CSVs and have the data from the corresponding QVD not loaded, you could replace the load * from the QVD replaced with the following code:
for each vFile in FileList('..\Data\*.csv') let vQVD = replace(vFile, '.csv', '.qvd'); Data: LOAD * FROM $(vQVD) (qvd) ; next
If optimisation is critical to you then loading from a single appended QVD will be quicker. This is because duplicate values are only removed within one QVD, not across multiple QVDs – the same data loaded from one QVD will be much quicker than loading from many. In order to achieve this data must be appended from each file (CSV or QVD) and stored into a single QVD. As there is no longer a QVD for each source file, a flag file is required to denote a processed file. The code to implement this approach would be:
LoadDone: LOAD now() as DateLoaded AUTOGENERATE(1) ; for each vFile in FileList('..\Data\*.csv') let vDone = replace(vFile, '.csv', '.done'); if alt(FileSize('$(vDone)'), 0) = 0 or FileTime('$(vDone)') < FileTime('$(vFile)') then Data: LOAD 1 as RowCounter, filename() as SourceFile, * FROM $(vFile) (txt, codepage is 1252, no labels, delimiter is ',', msq) ; STORE LoadDone INTO $(vDone) (txt); end if next if Alt(FileSize('..\Data\Data.qvd'), 0) > 0 then Data: LOAD * FROM ..\Data\Data.qvd (qvd); end if STORE Data INTO ..\Data\Data.qvd (qvd);
Finally, if there is a unique key and the chance of records to be updated in subsequent files then standard Incremental Load techniques can be layered on top, such as WHERE NOT EXISTS:
Data: LOAD * FROM ..\Data\Data.qvd (qvd) WHERE NOT EXISTS(ID) ;
As with any software development, with a bit of thinking through simple solutions can be arrived at that deliver great value. The time savings from this approach could be massive on each load. It could make the difference between having to bring in data overnight to doing frequent reloads during the day.
The Qlik Community is a great resource and you will find inspired solutions from a number of great QlikView developers on there. You can find a list of the responses I have made under my Qlik Community Profile. Add me as a friend to have posts I make added to your recent feed on Qlik Community.