Introducing the African Data Hub CKAN Repository (complete)
An overview of the ADH data repository - what it is, how to use it, and how the African Data Hub engages with it.
- Introduction
- Dataset vs Resource
- Metadata
- Table 1: Metadata description
- Types of Data Available on the ADH data repository
- Finding and accessing data
- Uploading Data
Introduction
The African Data Hub (ADH) seeks to support and promote quality data-driven journalism in Africa by providing newsrooms, researchers, and the general public with easy access to quality African data. African data is typically difficult to find, stored in unwieldy formats, and is often out of date. ADH is working to remedy this by actively seeking out interesting and useful African datasets, converting them to more easily accessible formats, updating and creating combined datasets where possible storing them all on our opensource, online, CKAN data repository.
CKAN is an open source data management system that is used by hundreds of organisations around the world including the national governments of USA, Canada, Singapore, Australia and others. We use this resource to host any data we find that we believe may be useful in serving our mandate in the promotion of quality data-driven journalism in Africa.
Dataset vs Resource
When using CKAN, it is important to understand the difference between the system's definition of a dataset
vs a resource
. A dataset is a collection of related data resources, while a resource is a single file. It may be useful to think of a dataset as a folder on your computer and a resource as a file in that folder. When posting data on CKAN, you first need to create a dataset and fill in the required metadata
. Then you can add resources either by uploading files, or linking them via URL.
Metadata
The importance of complete and correct metadata cannot be overstated. Metadata provides context for a given dataset which allows a potential user to understand what it is about, where it came from, how it was created and how it can be used. The following table provides details on the required metadata.
%%html
<h3><b>Table 1: </b>Metadata description</h3>
<p> </p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-j1i3{border-color:inherit;position:-webkit-sticky;position:sticky;text-align:left;top:-1px;vertical-align:top;
will-change:transform}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
@media screen and (max-width: 767px) {.tg {width: auto !important;}.tg col {width: auto !important;}.tg-wrap {overflow-x: auto;-webkit-overflow-scrolling: touch;}}</style>
<div class="tg-wrap"><table class="tg">
<thead>
<tr>
<th class="tg-j1i3"><b>Metadata</b></th>
<th class="tg-j1i3"><b>Description</b></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Title</td>
<td class="tg-0pky">A descriptive title</td>
</tr>
<tr>
<td class="tg-0pky">Description</td>
<td class="tg-0pky">Brief description of what is in the data - is this the industry standard, how does it compare to similar datasets, is this data hosted/used elsewhere?
Summary of the methodology (include link to methodology where applicable). Link to example analysis (where applicable)
</td>
</tr>
<tr>
<td class="tg-0pky">Tags</td>
<td class="tg-0pky">Main themes</td>
</tr>
<tr>
<td class="tg-0pky">Licence</td>
<td class="tg-0pky"><b>NB!!</b> Licence this data is shared under. Use <a href = "https://chooser-beta.creativecommons.org/"> this resource </a> if you're unsure
</td>
</tr>
<tr>
<td class="tg-0pky">Organisation</td>
<td class="tg-0pky">ADH organisation that sourced/uses this data</td>
</tr>
<tr>
<td class="tg-0pky">Visibility</td>
<td class="tg-0pky">Can be set to public/private (ADH members only)</td>
</tr>
<tr>
<td class="tg-0pky">Source</td>
<td class="tg-0pky"><b>NB!!</b> Link to where the data was found</td>
</tr>
<tr>
<td class="tg-0pky">Version</td>
<td class="tg-0pky">Version number</td>
</tr>
<tr>
<td class="tg-0pky">Author</td>
<td class="tg-0pky">Name of person/entity that produced the data</td>
</tr>
<tr>
<td class="tg-0pky">Author email</td>
<td class="tg-0pky">Contact email for person/entity who produced the data</td>
</tr>
<tr>
<td class="tg-0pky">Maintainer</td>
<td class="tg-0pky">Name of ADH member responsible for this dataset</td>
</tr>
<tr>
<td class="tg-0pky">Maintainer email</td>
<td class="tg-0pky">Contact email for ADH member responsible for this dataset</td>
</tr>
<tr>
<td class="tg-0pky">Groups</td>
<td class="tg-0pky">Project this data is used for</td>
</tr>
<tr>
<td class="tg-0pky">Data format</td>
<td class="tg-0pky">Filled in automatically</td>
</tr>
</tbody>
</table></div>
Types of Data Available on the ADH data repository
The data is available in a wide variety of formats, from spreadsheets (eg: XLSX, CSV) to geographic (eg: SHP, geoJSON) to image (eg: geoTIFF, PNG) and even documents like PDF, etc. Some datasets will include the same data in different formats under different resources. It is therefore not always necessary to download an entire dataset, but rather, only look for formats that you are comfortable with or are able to use. Different formats of data are also often different sizes. For a tutorial on exploring a dataset with different formats of the same data, see here ENTER PUBLISHED LINK HERE.
Many countries tend to release their data in PDF format. Unfortunately, data in PDF form is usually difficult to work with. As such, when we come across PDF data that we believe could be useful for an African data journalist, we try to extract the data and present it in csv or xlsx format, which is much easier to work with. When ever we do this, we are sure to include links to the original PDF documents so that the authenticity and correctness of the extracted data can be verified by anyone who wishes to use it.
Finding and accessing data
Our data is organised in terms of datasets
, organisations
, groups
and tags
. Datasets have already been described above.
Organisations
Each organisation that is part of ADH is represented by an organisation on CKAN. Any data that was found, produced or used by a particular ADH partner can be found in that partner's organisation.
Groups
Groups indicate the data category, for example Health
data, Economic
data etc. A dataset may belong to more than one group. Groups also serve as folders for all datasets used in a particular data tool or project. Finally, if several different datasets have all been sourced from the same place, then all of those datasets are placed in a group named after that source. See for example, the Humanitarian Data Exchange, and the accompanying blog postENTER PUBLISHED LINK HERE.
Tags
Tags identify interesting characteristics of the data. For example, if the data contains gender information, it is tagged with gender
, if it is geographical data, it is tagged with geodata
.
Search
Searching for datasets on the ADH data repository is done in two ways: by searching for terms that you type into the search bar found at the top of almost every page on the ADH data repository, and by filtering a list of search results. Entering a search term causes CKAN to look for matching terms in the titles, descriptions, locations and tags of a dataset. This is very similar to searching for something on Google. The resulting list of items can be further refined using the filter options on the left side of the search result. You can filter by organisation, group, tag, format, and license.
Once you have narrowed your search down to a manageable list of datasets, you need to click on one and see if it contains any resources that you wish to use. You can download any resource by clicking on download
on the dropdown explore
button to the right of the resource. As noted above, pay attention to the format of the resource that you wish to download and ensure that you download it in the correct format.
Uploading Data
CKAN controls who can upload or make changes to existing data through user permissions. There are three levels of user permissions with administrator
being the highest, followed by editor
, and member
. These permissions are assigned at both the organisation and group level. Only administrators and members are able to upload or edit data on CKAN. Members are able to view and download the data. There is also a system administrator
who has control over the entire system. Visitors to the site, that is, people who do not have login credentials, are only able to view and download public
data, while users with login credentials are able to view and download both public and private
datasets.
Uploading data to the ADH CKAN platform is really easy. First, click on the datasets
tab at the top of the screen and then click on the add datasets
button as seen in Figure 1.
You will be taken to a webpage like the one shown in Figure 2. Now fill in all of the metadata as described in Table 1.
Once you have filled in all of the required metadata fields, click the Next: Add Data
button near the bottom right of your screen as seen in Figure 3.
Give your data file a descriptive name and a good description and then either upload your file or link it via URL. Click on Save & add another
if you have more files to upload, and when you have completed your upload, click on the Add
button, seen here in Figure 4.
If you need to edit or remove a file, click on the Manage
button near the top right of your screen, as shown in Figure 5.