Data preparation
This section gives some ideas how the raw data in the monitoring database looks like.
The data has been extracted from ActivityInfo by using the ActivityInfo API and pre-processed to make it ready for the analysis.
The most of data extraction and cleaning are done beforehand (please see
R/
folder in the repository especially take a close look atetl.R
andetl-methods.R
files). If you want to download the raw data, you must have an access for it, that can be done by sourcing theetl.R
file.Data dictionary shows what the columns mean in the data:
Column name | Description |
---|---|
databaseId |
the internal ActivityInfo id for databases |
databaseName |
the name of databases visible to users |
folderId |
the internal ActivityInfo id for folders |
folderName |
the name of folders visible to users |
formId |
the internal ActivityInfo id for forms |
formName |
the name of forms visible to users (they are called as Sectors in this database specific) |
subFormId |
the internal ActivityInfo id for the sub-forms where the records are kept |
subFormName |
the name of the sub-forms visible to users |
Month |
month information for each record as YYYY-MM format e.g. 2019-05 |
code |
schema code |
question |
question label indicated by the field code |
response |
response given by users |
required |
a boolean value to check whether a question is required for saving a record |
type |
internal type for the code. The available types in the data are ‘Quantity’, ‘Narrative’ and ‘Enumerated’ |
partnerName |
the name of reporting partners. The name of implementing partners can be extracted from the data (from code column) |
canton |
the canton name of a record |
province |
the province name of a record |
description |
the description field further explaining what a question can mean |
reportingUsers |
a list-column containing unique user information reporting to a particular record |
The cells are represented as
NA
when fields not exists or not applicable.Please see the ActivityInfo documentation for more information about how the information is structured.
Automatic merging on duplicate entities
The duplicated entities found in the records are cleaned as much as possible during the data cleaning process.
For instance, as of August 2019, there are two Implementing partners1 with almost the same name are found in the data. They are merged into one because they imply the same partner.
This is fixed by using one of the “approximate string matching” algorithms that is readily provided by Open Refine2.
example_duplicate_partners <- c(
"Federación de Mujeres de Sucumbios",
"Federación de mujeres de sucumbíos"
)
example_collison <- refinr::key_collision_merge(example_duplicate_partners)
unique(example_collison)
## [1] "Federación de Mujeres de Sucumbios"
See the Partners section for more information about what the term means.↩
And the R package called refinr successfully ported it to R: https://cran.r-project.org/package=refinr↩