There are myriad different data sources out there, and it is possible to load any of them into Qlik. Some are easier than others, sure, but there is always a way. There are good articles and posts on some of the trickier sources, but today I want to show you how you can bring in CSV data from GitHub. This means you can pull up to date, crowdsourced and governed data directly into your applications, just by following these simple steps.
Introducing The CSV Data
Recently I was asked to be involved with the Data On The Earth project. This is an initiative to get data that can help everyone, from policy makers to concerned individuals, make more informed decisions about environmental issues. Naturally, I was very happy to help.
The first thing I wanted to do was to present some ideas over some data that had relevance to the job in hand. The site Our World In Data seemed like a good place to start.
There I found data which related to Co2 emissions, hosted on GitHub, you can find more about these data here. The data are by Country and Year, so a bit of extra metadata on the countries would allow some demographic filtering. I found this in two further CSV files, also conveniently hosted on GitHub.
The data are not perfect, as some of the figures are not reported by all countries in every year and countries change over time. There are things that could be done to fix this up a bit (rolling forward last known figures, for instance), but I have not done that here. If you are planning to lobby a government over their emissions (or something) please check your facts and don’t rely on what is presented in these data alone!
Creating a Connection
The code in this article is for the load script of Qlik Sense. You can do similar with the REST connector in QlikView, but the syntax is a bit different, this post should give you pointers should you wish to build your own load script.
As with any load script in Sense you first need to have your connection created. Rather than create a separate connection for each data source I load I tend to create a generic connection of each type required and modify it using WITH CONNECTION. You may already have the required connection set up if you have previously followed my tutorials on Loading Sense Session Logs or Loading RSS Feeds, if not go to the Data Load Editor and click Create New Connection.
Give your connection the name GenericGET and point it to a placeholder, such as https://jsonplaceholder.typicode.com/posts, ensure the Method is set to GET and leave everything else as defaults.
Click on Test Connection, then Close and finally (if all is good) Create.
Building the Load Script
Now, onto building the load script. First we need to do some set up, to connect to our connection and set up some variables:
LIB CONNECT TO 'GenericGET'; let vCompleteYears = 'Y'; // Y/N for all years or only 1990 - 2016 where data is quite complete let vExcludeYears = if(upper(left(vCompleteYears, 1)) = 'Y', 'AND (year >= 1990 AND year <= 2016)', ''); set vFix = if([$1] = '', '$3', [$1]) as [$2]; let v1B = 1000000000; let v1M = 1000000;
The first variable allows us to turn on or off data for years which are not that complete, the second puts the code to implement that into a variable if switched on. Next we have a variable which includes parameters that we will use later on. If you have not used them before you should see this post on Using Parameters with Qlik Variables. Finally a couple of variables to remove the risk of typing the wrong number of zeros later on.
Now we are set up, we can do the main part of the load.
Emissions: LOAD iso_code & year as CountryYear, iso_code as [Country ISO], country as Country, year as Year, co2 as CO2, consumption_co2 as [CO2 Consumption], trade_co2 as [CO2 Trade], co2_per_unit_energy as [CO2 Per Unit Energy], cement_co2 as [CO2 Cement], coal_co2 as [CO2 Coal], flaring_co2 as [CO2 Flaring], gas_co2 as [CO2 Gas], oil_co2 as [CO2 Oil], total_ghg as [Total GHG], methane as Methane, nitrous_oxide as [Nitrous Oxide], population as Population, gdp as GDP WHERE iso_code <> '' and index(iso_code, '_') = 0 // Remove rolled up values $(vExcludeYears) ; SQL SELECT "iso_code", "country", "year", "co2", "consumption_co2", "trade_co2", "co2_per_unit_energy", "cement_co2", "coal_co2", "flaring_co2", "gas_co2", "oil_co2", "total_ghg", "methane", "nitrous_oxide", "population", "gdp" FROM CSV (header on, delimiter ",", quote """") "CSV_source" WITH CONNECTION (URL "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv") ;
As with a database load this load will use the last opened connection. The CSV on the FROM line is specifying the basic format that we are loading from this connection, the exact format of the CSV is specified in the brackets – just as it would be if you were loading the CSV from a Folder Connection. The CSV_Source relates to which data source at the connection you wish to load – with a CSV this will only ever be CSV_source, but with JSON data there may be many named sets of data for you to choose from. The WITH CONNECTION then defines the URL we are loading from.
To explore these options further create another connection, pointing to that URL and look at the Select Data dialog for that connection.
Atop of the REST load we have a Preceding Load, to tidy the fieldnames and remove data we don’t want. There are rows with data rolled up to regional levels, but we are going to do that ourselves in Qlik so need to remove the roll ups to avoid double counting. Note that when doing a WHERE in a preceding load all of the data still has to come down from the REST source first, so you may want to look if you can pass parameters to the URL you are loading from to drop some data, rather than using a WHERE statement.
So, that is the data we require from our main data source. We are going to continue though, and augment and categorise these data some more.
Bringing in more CSV Data
The next table of data is a simple table of country demographics, provided by GitHub, keyed on the ISO code for the country.
Region: LOAD [ISO3166-1-Alpha-3] as [Country ISO], $(vFix(Developed / Developing Countries, Developed, Not Known)), $(vFix(Region Name, Region, Other)), $(vFix(Sub-region Name, Sub Region, Other)) WHERE EXISTS ([Country ISO], [ISO3166-1-Alpha-3]) ; SQL SELECT "ISO3166-1-Alpha-3", "Developed / Developing Countries", "Region Name", "Sub-region Name" FROM CSV (header on, delimiter ",", quote """") "CSV_source" WITH CONNECTION (URL "https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv");
This file has many more columns than we need, but we are cherry picking the few that we want. As you can see, the code for connecting to a CSV source is almost identical to above – just the URL is changing. In the Preceding load the vFix variable we created at the start is being used to replace any missing values with Not Known or Other. A WHERE EXISTS ensures we only get rows which relate to countries we have emissions data for.
Using a lookup table like this from a repository like GitHub in your Qlik apps means that your application will benefit from improvements to that dataset over time, rather than having a CSV or Excel spreadsheet languishing on your server which will never see an update.
Where In The World Is That?
As we are looking at country data it would be great to plot that on a map, and another GitHub source has just that information available for us. These data are provided and maintained by Tadas Tamošauskas in a personal repo.
Here we are going to LEFT JOIN the data onto the existing region table, bringing the latitude/longitude data into the same table as the regions. You will see why in a bit.
LEFT JOIN (Region) LOAD *, GeoMakePoint(lat,long) as [Country Point] ; LOAD purgechar([Alpha-3 code], ' "') as [Country ISO], purgechar([Latitude (average)], ' "') as lat, purgechar([Longitude (average)], ' "') as long WHERE EXISTS ([Country ISO], purgechar([Alpha-3 code], ' "')) ; SQL SELECT "Alpha-3 code", "Latitude (average)", "Longitude (average)" FROM CSV (header on, delimiter ",", quote """") "CSV_source" WITH CONNECTION (URL "https://gist.githubusercontent.com/tadast/8827699/raw/f5cac3d42d16b78348610fc4ec301e9234f82821/countries_codes_and_coordinates.csv");
This now lets us plot emissions data on a map by country. Our second data source though gave us Region and Sub Region names. It would be good if we could also plot by those regions. There is probably a data source out there which has SVG data for each region, but that could take a while to find and will not necessarily map to our existing region names.
We can apply a simple hack to generating points at each of the regions though.
RegionPoint: LOAD Region, GeoMakePoint(avg(lat),avg(long)) as [Region Point] RESIDENT Region GROUP BY Region; SubRegionPoint: LOAD [Sub Region], GeoMakePoint(avg(lat),avg(long)) as [Sub Region Point] RESIDENT Region GROUP BY [Sub Region];
Here, a couple of RESIDENT loads from the existing Region table can be grouped to give one row per Region and Sub Region. For each of these rows the average latitude and average longitude can be calculated. These two values can then be changed to a point that can be easily used on a map. Note that some points may appear in the ocean, but that is the joy of using averages.
Finding Latest Data and Grouping Figures
The figures from our initial dataset are great for trending over time as they are provided on an annual basis. They are not so great for categorical data and providing filters though, as a country could move from one bucket to another year to year. For these filters we only want to use the latest figures. Here we hit another minor snag though, in that the last year in which data is available differs by country.
We can deal with this by finding the latest year for each country, using a RESIDENT, GROUP BY and MAX to create a composite key.
LatestPop: LOAD [Country ISO] & Year as LatestPop ; LOAD [Country ISO], max(Year) as Year RESIDENT Emissions WHERE Population <> '' GROUP BY [Country ISO];
That composite key can then be used in the WHERE EXISTS of a subsequent RESIDENT load to only bring the latest data.
Population: LOAD Country, Population as [Population Latest], dual(num(Round(Population, 100000) / $(v1M), '#,##0.0') & 'M', Round(Population, 100000)) as [Population 100K], dual(num(Round(Population, $(v1M)) / $(v1M), '#,##0') & 'M', Round(Population, $(v1M))) as [Population 1M], dual(num(Round(Population, $(v1M)0) / $(v1M), '#,##0') & 'M', Round(Population, $(v1M)0)) as [Population 10M] RESIDENT Emissions WHERE EXISTS (LatestPop, CountryYear);
This is a useful technique, and one I use frequently in live projects. As well as getting the exact number for the latest year for each country we can also put things into buckets. This can be done with a massive IF statement or for equal buckets an aggregation. Here I have used a Round statement, but you could use a Floor or Ceil if you prefer. The Round function in Qlik is more flexible than in most languages as, rather than just rounding to a number of decimal places, you can specify what grouping you wish to go to, whether that is more or less than one.
The same technique can then be used for GDP, then there is just a little bit of housekeeping to do at the end of our script.
LatestGDP: LOAD [Country ISO] & Year as LatestGDP ; LOAD [Country ISO], max(Year) as Year RESIDENT Emissions WHERE GDP <> '' GROUP BY [Country ISO]; GDP: LOAD Country, GDP as [GDP Latest], dual(num(Round(GDP, $(v1B)), '#,##0') / $(v1B) & 'B', Round(GDP, $(v1B))) as [GDP 1B], dual(num(Round(GDP, $(v1B)0), '#,##0') / $(v1B) & 'B', Round(GDP, $(v1B)0)) as [GDP 10B], dual(num(Round(GDP, $(v1B)00), '#,##0') / $(v1B) & 'B', Round(GDP, $(v1B)00)) as [GDP 100B] RESIDENT Emissions WHERE EXISTS (LatestGDP, CountryYear); DROP FIELDS lat, long, LatestPop, LatestGDP;
So, that is it. Put all those bits of script together and you should have a working load script to bring in emissions and demographic data from three separate online CSV sources.
Now you just need a quick way to analyse those data. Fortunately we have your back there.
Viewing The Data In The Instant Sense Application
To make it quick for us at Quick Intelligence to produce Proof Of Concepts for potential clients I created the Instant Sense Application. This application allows you to load any data source, and a spreadsheet of metadata about that data source, and then explore the data over twelve sheets of user configurable visualisations and tables.
We have made this application available to all, at no cost, to make it quick and easy for others to showcase what Qlik Sense is capable of.
The application can be downloaded from here:
Details of how the application has been build can be found here:
If you already have the application, or if you just want the script above, you can download the script and meta-data here:
If you use the application and build more expressions that you feel will be useful for others, please share them in the comments below. As ever, I’m happy to receive questions and comments from readers, and will get back to these as soon as I am able.