Recently I posted a solution to a thread on Qlik Community that I thought would be worth sharing with readers of this blog. The question was around speading up the load times from an increasing number of CSV files arriving in a drop folder. The solution I gave used a couple of helpful techniques. Here I look to take those techniques further.
The CSV Advantage
CSV files are a good way of shifting data between disparate systems. They are completely universal and most data tools have a way of dealing with them. However, if you are used to the speed you get from an optimised QVD load; CSV load times can be a drag. This is due to the fact that CSV files do not benefit from the clever way that QVD files are constructed, where repeated values are not stored on every line.If you use QlikView a lot, at some point you will have to deal with many CSVs. So, how can you deal with them effectively?
The Basic Solution
In the Community post the question was around an increasing number of dated files. Each file containing rows to append to the existing data. Volumes were large, but not massive.
Given that QVDs load so much quicker than CSVs all we needed to do was load old data from QVD and new data from CSV – creating QVD files from the new CSVs as we go. The slow CSV load for each file would therefore only need to be done once.
The code I suggested to do this was as follows:
for each vFile in FileList('..\Data\*.csv')
let vQVD = replace(vFile, '.csv', '.qvd');
if alt(FileSize('$(vQVD)'), 0) = 0 then
NextFile:
LOAD
*
FROM $(vFile)
(txt, codepage is 1252, no labels, delimiter is ',', msq)
;
STORE NextFile INTO $(vQVD) (qvd);
DROP TABLE NextFile;
end if
next
Data:
LOAD
*
FROM ..\Data\*.qvd (qvd)
;
The key thing to note about this approach is how quick the file iteration and exists check takes place – this surprised me when I first tried this approach. When running over hundreds of files that have already been processed this routine still takes next to no time. The load from multiple QVDs at the end is optimised, so again is very quick.
The thing that I like about this approach is the simplicity. If, for some reason, you wanted to rebuild some or all of the QVDs you simply delete the QVDs you want to be rebuilt.
Taking It Further
There are many ways in which this approach could be improved upon. Here are a few thoughts on doing this.
I would always add some fields to any load – to make front end processing quicker. This can be done in this code when each CSV is loaded:
NextFile:
LOAD
1 as RowCounter,
filename() as SourceFile,
*
FROM $(vFile)
(txt, codepage is 1252, no labels, delimiter is ',', msq)
;
If you wanted to reload a CSV file when it’s date has changed since the QVD was last built; the FileSize could be added to with a FileDate check.
if alt(FileSize('$(vQVD)'), 0) = 0 or FileTime('$(vQVD)') < FileTime('$(vFile)') then
If you wanted to be able to delete one of the CSVs and have the data from the corresponding QVD not loaded, you could replace the load * from the QVD replaced with the following code:
for each vFile in FileList('..\Data\*.csv')
let vQVD = replace(vFile, '.csv', '.qvd');
Data:
LOAD
*
FROM $(vQVD) (qvd)
;
next
If optimisation is critical to you then loading from a single appended QVD will be quicker. This is because duplicate values are only removed within one QVD, not across multiple QVDs – the same data loaded from one QVD will be much quicker than loading from many. In order to achieve this data must be appended from each file (CSV or QVD) and stored into a single QVD. As there is no longer a QVD for each source file, a flag file is required to denote a processed file. The code to implement this approach would be:
LoadDone:
LOAD
now() as DateLoaded
AUTOGENERATE(1)
;
for each vFile in FileList('..\Data\*.csv')
let vDone = replace(vFile, '.csv', '.done');
if alt(FileSize('$(vDone)'), 0) = 0 or
FileTime('$(vDone)') < FileTime('$(vFile)') then
Data:
LOAD
1 as RowCounter,
filename() as SourceFile,
*
FROM $(vFile)
(txt, codepage is 1252, no labels, delimiter is ',', msq)
;
STORE LoadDone INTO $(vDone) (txt);
end if
next
if Alt(FileSize('..\Data\Data.qvd'), 0) > 0 then
Data:
LOAD
*
FROM ..\Data\Data.qvd (qvd);
end if
STORE Data INTO ..\Data\Data.qvd (qvd);
Finally, if there is a unique key and the chance of records to be updated in subsequent files then standard Incremental Load techniques can be layered on top, such as WHERE NOT EXISTS:
Data:
LOAD
*
FROM ..\Data\Data.qvd (qvd)
WHERE NOT EXISTS(ID)
;
Conclusions
As with any software development, with a bit of thinking through simple solutions can be arrived at that deliver great value. The time savings from this approach could be massive on each load. It could make the difference between having to bring in data overnight to doing frequent reloads during the day.
The Qlik Community is a great resource and you will find inspired solutions from a number of great QlikView developers on there. You can find a list of the responses I have made under my Qlik Community Profile. Add me as a friend to have posts I make added to your recent feed on Qlik Community.
Great advanced scripting example that got me thinking about incremental loads of text files. Rob Wunderlich recently posted about the buffer command and I remember that a buffer (incremental) command also existed. I didn’t know if it worked on CSV files, but after some test it apparently does. So if you have a simple requirement to load additional rows from a single CSV file without changing or deleting previously loaded rows this could also be a nice, quick fix. Of course, your solution is much more robust.
Karl
Great insights! Thanks for sharing them with us.
Thanks for your comment Rene – glad you find the solutions I post useful.
Hi Steve
Really a good document to read and helped me to utilize it in real time scenario
Thanks
Kushal
Hi Kushal, many thanks for your comment, glad you found the post useful. This technique that I use regularly when coming up against text based data sources.
Hi Steve,
Thank you for this, its really helpful and indeed incredible how fast the loads are.
I have however came across a small problem where the csv files contain “()” brackets in the filename:
eg. Test 123 (Production).csv
When that’s the case the following error comes up
Unknown file format specifier:Production
I am currently trying to come up with a workaround for this so will share when I do.
Regards
Marek
Hi Marek,
Thanks for your comment. You will want to add square brackets around each instance of $(vQVD) or $(vFile). I’m usually very careful to avoid punctuation or spaces in filenames, so don’t need to worry about square brackets. In your case though you will need to add the square brackets in.
Hi again,
Thanks so much for the quick and “obvious” suggestion! I have been looking at the problem from a completely wrong angle :)
Solution now works great!
Hi Steve – That’s a Nice article, as USUAL!
Thanks, Pravin. Glad you like the posts.
Just posted some additional code on QlikCommunity, for merging the QVDs created from the routines above into Monthly and Master QVDs: https://community.qlik.com/thread/170103
Enjoy.
I can’t seem to get the basic run of this to work. It keeps telling me that the system cannot find the qvd file in the data: section. What am I doing wrong?
Hi James – have you used the Debug feature in the Script editor to step through each line of code to see what is happening? My best guess is that the CSV folder is not where the code is expecting it to be, so no data is being loaded from the CSV (as it would not be found by the IF statement), no QVD is therefore written so there is no QVD to load. If you are using the path ..\Data\*.csv you would need to have an app in a folder, which has a Data folder at the same level as it. You could try putting the CSV files in the same folder as the QVW and changing all instances of ..\Data\ to .\, to see if that works?
Hi Steve,
I need to writ the qvds into a separate folder. How can I achieve this please?
Thanks
SB
You can construct the output path in code, and this can include variable folders as well as filenames. Set the variables for the various parts in the loop and then use them in the store, something like this:
STORE CurrentFile INTO [..\QVDRoot\$(thisFolder)\$(currentFile)_$(dateSuffix).qvd];
Hi..nice info you have there, but I have an issue where i want to load each excel files in a folder(all excel) and store each of them in another folder(all qvd). I think your basic solution works for the issue but I can’t figure out how to store the qvds to another folder. I need to use the same file name so i need the replace function.
Hope you can help me.
Thanks.
The simplest way is probably to have two replace statements. The first to replace the extension and the other to replace the file path. The other way is to split the file name part from the path and then concatenate on the output path. Hope that helps.
Hi Steve,
This looks like what i am looking to do in QS. However, I cannot get the script to work.
Any thoughts? I am new to coding in QS
Hi,
You are going to need to give me a little more information than that, regarding how far you have got so far and what error you are seeing. Please share some more detail and I will be happy to help.
Hi Steve,
Will this code work with Qlik sense.Thanks
Hi,
The approach, and the majority of the code will work in Sense. The only thing that needs to change is that the references to drive letters, UNC or relative paths need to change to valid lib:// connections. Give a shout if you get stuck.
Hi Steve,
would it be possible to load in the files from multiple folders instead of one folder? i have hundreds of sub folders that contain further sub folders where i need to get all the csvs from.
all the folders have one parent folder.
is it possible to search all the sub folders in the parent folder and pull in all the csvs?
thanks
Hi,
The quick answer is, Yes, you can. As well as the FileList function there is also a DirList function, which works in a similar way but lists sub folders. You should then be able to create a nested loop that finds all files in all folders. Good luck!