Data preparation

This section gives some ideas how the raw data in the monitoring database looks like.

The data has been extracted from ActivityInfo by using the ActivityInfo API and pre-processed to make it ready for the analysis.
The most of data extraction and cleaning are done beforehand (please see R/ folder in the repository especially take a close look at etl.R and etl-methods.R files). If you want to download the raw data, you must have an access for it, that can be done by sourcing the etl.R file.
Data dictionary shows what the columns mean in the data:

Column name	Description
`databaseId`	the internal ActivityInfo id for databases
`databaseName`	the name of databases visible to users
`folderId`	the internal ActivityInfo id for folders
`folderName`	the name of folders visible to users
`formId`	the internal ActivityInfo id for forms
`formName`	the name of forms visible to users (they are called as Sectors in this database specific)
`subFormId`	the internal ActivityInfo id for the sub-forms where the records are kept
`subFormName`	the name of the sub-forms visible to users
`Month`	month information for each record as `YYYY-MM` format e.g. `2019-05`
`code`	schema code
`question`	question label indicated by the field `code`
`response`	response given by users
`required`	a boolean value to check whether a question is required for saving a record
`type`	internal type for the code. The available types in the data are ‘Quantity’, ‘Narrative’ and ‘Enumerated’
`partnerName`	the name of reporting partners. The name of implementing partners can be extracted from the data (from `code` column)
`canton`	the canton name of a record
`province`	the province name of a record
`description`	the description field further explaining what a question can mean
`reportingUsers`	a list-column containing unique user information reporting to a particular record

The cells are represented as NA when fields not exists or not applicable.
Please see the ActivityInfo documentation for more information about how the information is structured.

Automatic merging on duplicate entities

The duplicated entities found in the records are cleaned as much as possible during the data cleaning process.

For instance, as of August 2019, there are two Implementing partners¹ with almost the same name are found in the data. They are merged into one because they imply the same partner.

This is fixed by using one of the “approximate string matching” algorithms that is readily provided by Open Refine².

example_duplicate_partners <- c(
  "Federación de Mujeres de Sucumbios",
  "Federación de mujeres de sucumbíos"
)
example_collison <- refinr::key_collision_merge(example_duplicate_partners)
unique(example_collison)

## [1] "Federación de Mujeres de Sucumbios"

See the Partners section for more information about what the term means.↩
And the R package called refinr successfully ported it to R: https://cran.r-project.org/package=refinr ↩