Whenever you are given a new data set to consume in QlikView the first thing you will want to do is take a ‘quick look’ at what information it is you have been given. To enable me to do this quickly I have a few QlikView objects I can copy and paste into any document to allow me to do this. You can download these from QlikCommunity or read on to find out how to create your own.
System Tables
There are a few System tables in all QlikView documents that allow you to interrogate what your data model looks like. To see these you need to enable the Show System Fields checkbox. You will find this on the Fields tab of the Sheet Properties dialog. When this check box is ticked some new fields appear, prefixed with a dollar symbol. The two that are used in this QlikView document are $Field and $Table. This is the same way you can expose fields that you have marked with HidePrefix. These new fields can be used in any document as they are, but we are going to do a bit more with them. Be warned if you are using them however that they are data islands – and as such you need to be careful mixing them with fields from the rest of your data model in expressions.
Picking A Single Field
When profiling our data we are going to look at a single field from the data model at a time. The field to be looked at is the first ‘available’ field alphabetically. If no fields are selected this is across all fields, or a single field can be selected from the $Field list box.
In order to go from many possible values in the $Field list to just the first we can use a bit of Set Analysis to obtain that value:
=minstring($Field)
This expression is used in the caption of a list box that shows each distinct value in the selected field. The listbox itself has an expression as the data source rather than a single field. The expressions that you can have in a list box are much like the Calculated Dimensions in charts. If the result of the expression is not a field name then selections can sometimes behave a bit strange, but here we are just deriving a field name so all is well. The code required is:
=[$(=minstring($Field))]
You will note it is similar to the caption – we are just expanding it with a dollar sign expansion and encapsulating it in square brackets (in case there is a space in our field name). The result looks like a standard list box:
One of the useful things about this list box is that it allows you to make selections that are reflected through the rest of the document. In this way you can have a tab where a user can see a full list of values in any field and select them.
Finding Some Statistics
Once we have found the field we want and have a distinct list of the field values we will want to find out a bit more about the field. The most basic thing to do is to count how many times each of the values occurs in the data:
Whilst you should generally try to avoid them, the most generic way to find the frequency of each of the values is to do a Count, so the Expression for the above chart is:
=count([$(=minstring($Field))])
The dimension is the same as for the list box. If you have built the data model and you can rely on their being a counter field (which will be more efficient) then you could do a sum of the Counter field instead.
Digging A Bit Deeper
Counting values is just one of the ways in which you can find out more about the contents of the field. I’ve picked a few simple measures that give you a bit of a flavour about what is in there. The values I have chosen for my profiler are:
The expressions are all relatively straight forward, just with the field name inserted into them with a dollar sign expansion, for instance Max Length is calculated with this expression:
=max(len([$(=minstring($Field))]))
I tend to stick with the eleven metrics above, but when Barry Harmsen over at the QlikFix showed me the document he had created for the same purpose he had many many more expressions in there. Feel free to add more to the list above – but there will be a trade off with performance as you add expressions.
Download The QlikView Data Profiler
From the details given in this article you should be able to build your own data profiling page, with components you can drop onto any QlikView document. If you would prefer a pre-canned QlikView Download then you can get this from QlikCommunity – where I have uploaded my own document:
https://community.qlik.com/t5/Member-Articles/QlikView-App-Generic-Data-Profiler/ta-p/1483687
If you like the app then please acknowledge this with a positive rating on the Community. Please also see the other documents I have uploaded under my QlikCommunity account.
As always I appreciate any feedback – so feel free to use the comment box below.
Hi Steve,
Thanks for the mention.
I don’t really use that many more stats than you do, but I just calculate a few that I find useful such as cardinality and information density. Haven’t figured out how to calculate subset ratio yet :)
One thing that is different in my versionis that I calculate the stats over all fields instead of 1 field at a time, see this screen shot: https://dl.dropboxusercontent.com/u/6795646/MastersSummit/Profile.png
It’s a beast of an expression, for example for getting the row count for each field:
pick(match([$Field], $(=chr(39) & concat(TOTAL DISTINCT $Field, chr(39) & ‘,’ & chr(39)) & chr(39))),
$(=concat(TOTAL DISTINCT ‘Count([‘ & $Field & ‘])’, ‘,’)))
… but it works :)
Cheers,
Barry
It’s an excellent way of getting an overview in a single exportable view. I’m sure I will find myself using it before long! The viewing of counts over combinations of up to five fields is also nice. I would just worry about that list getting very long and performance lagging on larger data sets. As you said at the Masters Summit, your one is a tab for developers only – who know what they are doing.
Oh, and sorry about your formatting!
You really have to fix processing of returns, my nicely paragraphed comment looks like a rambling mess :)
Hi Steve,
Do you use these stats for all your projects? How does this help you?
JV
biexperience.wordpress.com
Hi JV, the information I get from the Profiler is usually most useful when doing the data discovery part of a project – deciding which columns are important etc.. Sometimes the data profiler tab remains useful beyond that – perhaps as a means for the user to make selections when they are simply unable to slim the number of fields in the data model to a sensible number. As it only takes seconds to copy and paste the objects I tend to do this onto new documents almost automatically.
Hi Steve, very cool post and it is amazing how many people skip data profiling step and pay for consequences later. But if have not checked this open source project yet, you are missing a lot :) it is called Data Cleaner and it is a good idea to use data profiling tool like Data Cleaner before you even load data to QV. It is amazing how many features that tool provides and now it is integrated with Pentaho.
This particular tool looks like is different from the QlikView analytic tool. This is a Data Profiler.
I have been evaluating Data Cleaner and x88 software’s Pandora for Data Profiling.
Is there any info on features comparison between Qlik Data Profiler and say Data Cleaner or any other data profiling tool?
This is an app that people can use inside QlikView for basic data profiling. QlikView is not a data profiling app in itself. That said apps like mine and others like it mean QlikView can fulfil this task. It is also reasonably simple to build your own custom apps for data profiling.
[…] quick web search brought me to Steve Dark’s wonderful website. He tackles the issue by copying his data profiling objects to existing QlikView applications, thus […]
Thank you Steve, I would leave a positive but not sure where… fantastic!
Thanks Jo. Glad you like the app. Where there are apps, Likes on Qlik Community are always welcome!
[…] данных. Для этого удобно пользоваться решением Data Profiler, разработанным Стивом Дарком (Steve Dark). Очевидно, что при […]
[…] воспользуемся готовым инструментом Data Profiler от quickintelligence.co.uk. Он поможет увидеть все нулевые значения по […]
[…] como referencia el excelente trabajo realizado por Steve Dark (vid qlikintelligence.co.uk) he adaptado las funcionalidades de profiling a mis […]
Hey great tool its been really helpful for me at work.
Question regarding what Barry mentioned about calculation across fields.
Is it possible to share some of the expressions used ?
the one he posted is not working for me and i cant seem to figure out how to do cross field calculation
Say if i want to calculate number of nulls across all fields not just one at a time
Thank you again for any help
Hi Abed,
Barry’s code was mangled by WordPress, you need to replace all of the single quote marks with a proper single quote mark, as WordPress has changed them to open and close quotations.
In order for this to give you a total count of nulls you need to have a table with $Field as the dimension, and this as the expression and the total mode as Sum of Rows:
pick(match([$Field], $(=chr(39) & concat(TOTAL DISTINCT $Field, chr(39) & ‘,’ & chr(39)) & chr(39))),
$(=concat(TOTAL DISTINCT ‘sum(if(Isnull([‘ & $Field & ‘]), 1, 0))’, ‘,’)))
Personally, I would parse each field in the load script in order to get to those kinds of stats, to avoid heavy processing in the front end. For a reasonably sized app though the above should work just fine.
Hope it helps.