Sometimes you need to create test data to ensure that a Qlik function behaves the way you expect it should. Recently I had just that requirement to prove that the Correl function gave good results over semi-predictable data. This post shows how you can create data on the fly for these kind of purposes.
Data Without a Source
Typically a Qlik application will load its data from a data source, such as a database, API or flat file. Sometimes you just want to create something on the fly. The AUTOGENERATE function is a great way of doing this, and many of the applications on my Qlik Examples page use just this functionality. The function gives a specified number of rows of data that you can then work from. To create a table with a single field with values from 1 to 1,000:
LOAD RowNo() as Value AUTOGENERATE(1000);
This may be useful in and of itself, but it is more likely you will need to do more. Read on to find out some of the things you can do.
Well, That’s A Bit Random
Often you will want to create some random values, and for this the rand() function is what you require. This function will return a number which falls somewhere between 0 and 1. Whilst at first this sounds a bit limited, with some very basic maths you can create many kinds of values.
So, to get a random integer between 0 and 999 you could do the following:
LOAD floor(rand()*1000) as Value AUTOGENERATE(1000);
If you wanted to get a relatively random number of dimension values from a list you can multiply the random value by the number of values you have, round up to the nearest integer and then use a pick() function to choose which dimension value to pick:
LOAD pick(ceil(rand()*5),'Ben','Ian','Tom','Paul','Olly') as Person AUTOGENERATE(1000);
Note with this approach though you will get roughly the same number of rows for each of the values, due to how random numbers work. If you want to have less regular number of rows you will need to introduce some bias.
If you want to generate random codes that look like they could be real (but are not) you can use the fact that the chr() function picks an ascii character based on a number. To get a code that looks like ABC123 you would have:
LOAD chr(65+floor(rand()*26)) & chr(65+floor(rand()*26)) & chr(65+floor(rand()*26)) & num(ceil(rand()*999), '000') as PersonID AUTOGENERATE(1000);
Given the number of possible values that this could generate you would most likely get all unique values. If you would prefer to guarantee that there are no duplicates you could use rowno() instead of rand(). If you wanted to cause more duplicates you could reduce the number of characters and the multipliers on each of the random values, perhaps reducing the letters to A to D and the digits to all be multiples of 9:
LOAD chr(65+floor(rand()*4)) & chr(65+floor(rand()*4)) & num(ceil(rand()*99)*9, '000') as DupedID AUTOGENERATE(1000);
If you want to have data shown on a line chart you can generate dates, using the fact that dates are just numbers under the bonnet and we can therefore do maths on them. Again, if you want one row per date then use the rowno() function, or if you want more randomness around the rows per day use rand():
LOAD date(today()-(rowno()-1), 'DD MMM YYYY') as Date, date(today()-floor(rand()*90), 'DD MMM YYYY') as [Date Rolling 3m] AUTOGENERATE(1000);
Creating dates in this way has been particularly useful in some of the demo apps I’ve created, as a simple reload will make the demo look fresh and current as all dates are in a period leading up to the current date.
That’s just a few examples of generating test data using AUTOGENERATE, but I am sure you can think of many more.
But Does It All Correlate?
I mentioned at the top that I had a recent requirement to test the correl() function, and generated some data to do this. For those who have note come across the statement before, it returns the aggregated correlation coefficient for two data sets. Or put into English, a value is returned which is close to 1 if there is a very strong positive correlation, close to -1 if there is a strong negative correlation and close to zero if there is very little correlation between two sets of values.
The function is called like this:
correl(aggr(sum(X), Dim), aggr(sum(Y1), Dim))
This is where Dim is a dimension value in the data and the two aggr statements create two datasets with a row for each dimension value. The sum(X) and sum(Y) expressions would usually be more complex calculations, but in this case it is simply the numbers generated in the load script.
The load script to make this expression work is as follows:
LOAD 'Row ' & num(RowNo(), '#,##0') as Dim, RowNo() as X, RowNo() as Y1 AUTOGENERATE(1000);
Because we are using the row number for both the X and Y1 values there is an absolute positive correlation and the function returns 1, as it should.
For all subsequent tests we keep the Dim dimension and the X value, which always gives us a point along the bottom of the chart. For a negative correlation the value of Y can be calculated as:
1000 - RowNo() as Y2
So now when X is 0 Y is 1000 and vice versa. This is an absolute negative correlation, and the correl function returns -1, again as it should.
The rest of the values for Y which I tested for where as follows:
Rand() * 1000 as Y3
A totally random value, which gives a correlation coefficient that is very close to zero and a plot that is full of randomly placed dots.
RowNo() * Rand() as Y4
A random value between 0 and 1 multiplied by the X value, so that the Y value is random, but never greater than Y. This gives a correlation coefficient that is around 0.5 and a chart where the bottom right half is filled. This could be inverted, but I did not try that value.
(RowNo() * (Rand()-0.5)) + 500
A value that deviates from the centre of the chart on the Y axis by a random number that is plus or minus half of the value of Y. As the values fall above and below the line equally the correlation is very close to zero.
RowNo() + ((Rand()-0.5)*(RowNo()/1)) as Y6
This value is similar to the X=Y chart we started with, but with a random deviation from the line in either direction, up to half of the Y value. The divide by 1 is left in there as I was experimenting with having the deviation greater or lesser by dividing the row number. Correlation coefficient is approaching 1, but is not there.
500 + ((rand()*10)-5) as Y7
This value has a similar Y value for every value of X. I started off having it exactly the same for each X value, but the correl function requires some variance in the number or the spread will be zero causing a divide by zero and no correlation coefficient number. A random deviation of up to 5 in either direction provides this difference, but that deviation is small enough that the correlation coefficient is very close to zero.
500 + ((rand()*10)-5) as X2
The final expression is a bit different to all the others, as I wanted values from 1 to 1000 for all Y values in order (I use Y1 from above for this) and then have a calculation for the X value that is close to the same value, but off by a small margin. You will note the expression for X2 is identical to Y7, and I could have used the same field in the correl function, but it felt wrong to have two values labeled Y in the chart.
The resultant scatter charts and correlation coefficients for all of the derived values above can be seen in this screenshot (click to expand).
The entire load script for the Correl application looks like this:
LOAD 'Row ' & num(RowNo(), '#,##0') as Dim, RowNo() as X, RowNo() as Y1, 1000 - RowNo() as Y2, Rand() * 1000 as Y3, RowNo() * Rand() as Y4, (RowNo() * (Rand()-0.5)) + 500 as Y5, RowNo() + ((Rand()-0.5)*(RowNo()/1)) as Y6, 500 + ((rand()*10)-5) as Y7, 500 + ((rand()*10)-5) as X2 AUTOGENERATE(1000);
And the entire application can be downloaded from Qlik Community here:
Hopefully that has given you some inspiration what you can do when you need to generate some test data. I would be very interested to hear of other thoughts you might have on the topic, so please do use the comments box below to let me know.