Data preparation

This section gives some ideas how the raw data in the monitoring database looks like.

  • The data has been extracted from ActivityInfo by using the ActivityInfo API and pre-processed to make it ready for the analysis.

  • The most of data extraction and cleaning are done beforehand (please see R/ folder in the repository especially take a close look at etl.R and etl-methods.R files). If you want to download the raw data, you must have an access for it, that can be done by sourcing the etl.R file.

  • Data dictionary shows what the columns mean in the data:

Column name Description
databaseId the internal ActivityInfo id for databases
databaseName the name of databases visible to users
folderId the internal ActivityInfo id for folders
folderName the name of folders visible to users
formId the internal ActivityInfo id for forms
formName the name of forms visible to users (they are called as Sectors in this database specific)
subFormId the internal ActivityInfo id for the sub-forms where the records are kept
subFormName the name of the sub-forms visible to users
Month month information for each record as YYYY-MM format e.g. 2019-05
code schema code
question question label indicated by the field code
response response given by users
required a boolean value to check whether a question is required for saving a record
type internal type for the code. The available types in the data are ‘Quantity’, ‘Narrative’ and ‘Enumerated
partnerName the name of reporting partners. The name of implementing partners can be extracted from the data (from code column)
canton the canton name of a record
province the province name of a record
description the description field further explaining what a question can mean
reportingUsers a list-column containing unique user information reporting to a particular record
  • The cells are represented as NA when fields not exists or not applicable.

  • Please see the ActivityInfo documentation for more information about how the information is structured.

Automatic merging on duplicate entities

The duplicated entities found in the records are cleaned as much as possible during the data cleaning process.

For instance, as of August 2019, there are two Implementing partners1 with almost the same name are found in the data. They are merged into one because they imply the same partner.

This is fixed by using one of the “approximate string matching” algorithms that is readily provided by Open Refine2.

## [1] "Federación de Mujeres de Sucumbios"

  1. See the Partners section for more information about what the term means.

  2. And the R package called refinr successfully ported it to R: https://cran.r-project.org/package=refinr