Textual data preparation
There are usually four steps involved in the textual data preparation process:
1. Tokenization
Tokenization means to split a text into tokens considered meaningful units of text. A token can either be a word (and often it is) or a group of words (such as bigram), or even a sentence that depends on the level of analysis.
Perform stemming, which you bring words (nouns/verbs) back to base or infinitive forms, will be the next step after tokenization, so we can get the essence of words.
2. Strip punctuation
Punctuation is often not required in text analysis (unless a researcher wants to tokenize the text based on a specific classifier such as sentence tokens); therefore, they create noise.
3. Convert text into lowercase
When the text turned into lowercase, for instance, the words respuesta and Respuesta will no longer be taken as different words.
4. Exclude stopwords & numbers
Stop words usually mean the most common words in a language that will bring no significant results in analysis. They are overly distributed in the text and they will not give so meaningful results itself. Stop-words are including articles (el/la), conjunctions (y), pronouns (yo/tú/etc.) and so on.
In text mining, this process is usually done after the text converted into lowercase so one does not have to provide stop words including both lower and sentence case versions.
We have imported a list of Spanish stopwords data (source
here, and that’s the
alternative for stopwords_es
list from the
corpus3 package and perform a
filtering join returning tokens from textual data by excluding the words
listed in the stopwords. that only returns the tokens not listed in the
stopwords.
The original tokens for the response originally have 68626 rows. However, after merging stop words, the number of rows have decreased to 34912 and that the change in between is 51%.
It’s also possible to add more custom words e.g. ACNUR, if some organization names are not desired, or violencia, if some words are overused and brings no further explanation, in the results.
5. Perform stemming
Stemming is a process that removes the suffixes (and sometimes prefixes) of the words and bring them to the base form. The “Hunspell” stemmer is used (from the package hunspell4 that provides more precise stemming behavior than the other stemmers available in R ecosystem.
From that point onwards, we will use stemmed words instead of the raw tokenized words because stemmed words give us better information.
After stemming, the words look like this:
word | word_stem |
---|---|
complementa | complementar |
entrega | entregar |
realizada | realizar |
meses | mesar |
anteriores | anterior |
... | ... |
Patrick O. Perry (2017). corpus: Text Corpus Analysis. R package version 0.10.0. https://CRAN.R-project.org/package=corpus↩
Jeroen Ooms (2018). hunspell: High-Performance Stemmer, Tokenizer, and Spell Checker. R package version 3.0. https://CRAN.R-project.org/package=hunspell↩