Textual data preparation

There are usually four steps involved in the textual data preparation process:

1. Tokenization

Tokenization means to split a text into tokens considered meaningful units of text. A token can either be a word (and often it is) or a group of words (such as bigram), or even a sentence that depends on the level of analysis.

Perform stemming, which you bring words (nouns/verbs) back to base or infinitive forms, will be the next step after tokenization, so we can get the essence of words.

2. Strip punctuation

Punctuation is often not required in text analysis (unless a researcher wants to tokenize the text based on a specific classifier such as sentence tokens); therefore, they create noise.

3. Convert text into lowercase

When the text turned into lowercase, for instance, the words respuesta and Respuesta will no longer be taken as different words.

4. Exclude stopwords & numbers

Stop words usually mean the most common words in a language that will bring no significant results in analysis. They are overly distributed in the text and they will not give so meaningful results itself. Stop-words are including articles (el/la), conjunctions (y), pronouns (yo/tú/etc.) and so on.

In text mining, this process is usually done after the text converted into lowercase so one does not have to provide stop words including both lower and sentence case versions.

We have imported a list of Spanish stopwords data (source here, and that’s the alternative for stopwords_es list from the corpus³ package and perform a filtering join returning tokens from textual data by excluding the words listed in the stopwords. that only returns the tokens not listed in the stopwords.

The original tokens for the response originally have 68626 rows. However, after merging stop words, the number of rows have decreased to 34912 and that the change in between is 51%.

It’s also possible to add more custom words e.g. ACNUR, if some organization names are not desired, or violencia, if some words are overused and brings no further explanation, in the results.

5. Perform stemming

Stemming is a process that removes the suffixes (and sometimes prefixes) of the words and bring them to the base form. The “Hunspell” stemmer is used (from the package hunspell⁴ that provides more precise stemming behavior than the other stemmers available in R ecosystem.

From that point onwards, we will use stemmed words instead of the raw tokenized words because stemmed words give us better information.

After stemming, the words look like this:

word	word_stem
complementa	complementar
entrega	entregar
realizada	realizar
meses	mesar
anteriores	anterior
...	...

Patrick O. Perry (2017). corpus: Text Corpus Analysis. R package version 0.10.0. https://CRAN.R-project.org/package=corpus ↩
Jeroen Ooms (2018). hunspell: High-Performance Stemmer, Tokenizer, and Spell Checker. R package version 3.0. https://CRAN.R-project.org/package=hunspell ↩