Jerid Francom bio photo

Jerid Francom

Associate Professor of Spanish and Linguistics
Romance Languages
Wake Forest University

Curriculum vitae

Email Twitter Github Stackoverflow Last.fm

I’ve been working on a Shiny web app to visualize the results from a text classification algorithm. Specifically, the app aims to classify a particular document in Spanish as either approximating the usage from one of three Spanish subvarieties: Argentine, Mexican, and Peninsular.

In addition to an overall classification, and returning the corresponding probability score, I also want to be able to see how the individual features in the text contribute to the classification. After applying the classification algorithm to a loaded document, the feature probabilities look something like this:

  aceptar ahora ándale aprovecha arreglarse azul baja bolsa casa casa
Argentina 41.87 30.06 0.47 22.27 28.41 34.04 27.77 27.94 34.9 34.9
Mexico 28.8 34.36 97.16 49.92 63.7 36.01 33.92 41.75 27.38 27.38
Spain 29.32 35.58 2.37 27.81 7.89 29.95 38.31 30.31 37.72 37.72

My first whack at a visualization of these feature probabilities was to create a function markup.text(results = , document = ) to take the words in the document and match them with their corresponding feature scores and use theses scores to apply a color gradient in HTML markup from most and least indicative of the assigned class.

Example HTML output for one word:

<span style = "background-color: rgb(74,255,000);">chavo</span>

Words in the document but not in the model are marked in grey.

Here are the results we are looking for:

se trata de un chavo que pasa a la casa de su novia a recogerla . su suegro lo recibe muy contento . el chavo está vestido de smokin y la chica lo saluda , también con un vestido azul muy elegante , pero le indica que la espere porque va a terminar de arreglarse . el papá de la novia saca una pequeña licorera de metal y se la da al novio , quien lo mira extrañado y le dice con señas que no la quiere aceptar . el papá insiste y la mete en la bolsa del saco del novio . ahora , el papá saca un enorme churro y se lo ofrece , como diciéndoleándale , aprovecha de una vez “ . el chavo , con una sonrisa nerviosa , nuevamente le dice que no , pero el papá mete también el cigarro de mariguana en el saco del chavo . finalmente , el papá saca un preservativo y se lo ofrece al novio , quien se resiste , pero el papá hace un gesto como diciendomás vale prevenir que lamentar “ . cuando baja la novia y saluda al chavo , el papá de inmediato mete la mano al saco del novio y le enseña a la chava la licorera , el churro de mariguana y el preservativo . ella queda muy asombrada y se enoja con su novio . por supuesto , el chavo trata de explicar , pero el papá lo corre de inmediato de la casa y manda a la chica de regreso a su recámara , con un gesto de regaño , como diciendo “ ¡ mira con qué clase de personas sales ! “ . ella se sube muy triste y enojada y el señor se queda muy contento , haciendo un gesto de “ ¡ victoria , lo conseguí ! “ .

How did I get here:

  • Prep the original document for comparison with the model by creating a text vector with lowercased words and isolated punctuation.

Original document

Llega un chico vestido muy elegante como que va a un baile de graduación o algo así. Llega a la casa de su novia, o amiga, y el papá de la chica lo recibe. Antes de que baje la chica por las escaleras el señor, papá de la chica, le da una botellita de alcohol, un gallo y un condón. El chico está muy apenado y el papá hace gestos de como “no te preocupes, soy bien buena onda, conmigo no hay pedo” y le pone todas las cosas en el bolsillo. En ese instante baja la hija ya vestida para ir al baile y antes de irse el papá los para y le revisa los bolsillos al chico y encuentra todo lo que él le había puesto. Hace un pancho y corre al chico de su casa y le dice a su hija que se regrese a su cuarto.
# Prep document
clean.document <- document %>%
        tolower %>%
        gsub(pattern = '([[:punct:]])', replacement = " \\1 ", x = .) %>%
        strsplit(split = " ") %>%
        unlist
  • From the MNB results file, find information about the class selected and the scores for all classes. Take the feature probability scores and calculate the difference between the next closest class and the class selected to create a more telling relative score. When a feature probability matches for all classes, this feature is not in the model. When a relative score equals the median score it is marked as an OOV (out-of-vocabulary) item.
# Get information on classes
class <- results$classification # class selected
num.classes <- nrow(results$feature.probs) # Get the number of classes

# Retrieve scores for given class
oov.score <- round(100/num.classes, 2) # Get the out-of-vocabulary score (uniform prob.)
class.scores <- results$feature.probs[rownames(results$feature.probs) == class, ] # Scores for class selected
median.scores <- results$feature.probs %>% apply(2, median) # Find median score from all classes
median.scores[median.scores == oov.score] <- NA # Mark OOV items
scores <- class.scores - median.scores # Get the relative difference between selected class and next nearest class
  • Now, a loop runs through and compares the words in the model with the words of the text. When a word appears in the text that is in the model, an RGB number is assigned on a gradient from green to red corresponding to most and least indicative of the word for the class selected. Words not in the model are not included in the gradient. HTML markup is added to the words using the RGB specifications.
# Sequence through the document marking up words based on conditional probs
tagged.document <- character() # init `tagged.document`
for(i in seq(clean.document)) {
        if(clean.document[i] %in% names(scores)) { # Find 'words' and add markup
                score <- scores[names(scores) == clean.document[i]][1]
                if (is.na(score)) {
                        rgb <- paste0("255,255,255") # white
                } else if (score >= 0) {
                        x <- round(abs(((score/100)*255)-255), 0) # relative score
                        rgb <- paste0(x, ",255,000")
                } else {
                        x <- round(abs(((score/100)*255)+255), 0) # relative score
                        rgb <- paste0("255,", x, ",000")
                }
                tagged.document <-
                        c(tagged.document,
                          paste0('<span style = "background-color: rgb(',
                                 rgb,
                                 ');">',
                                 clean.document[i],
                                 '</span>'))
        } else { # other units, leave them alone
                tagged.document <- c(tagged.document, paste(" ", clean.document[i], " "))
        }
}
  • Finally the resulting vector is flattend, and returned as a single character vector.
# Return `tagged.text` as running text
tagged.document %>% paste(collapse = "", sep = "")
return(tagged.document)

Returning to the Shiny application, in the server.R script a reactive input renderUI({}) pulls the data from an uploaded file and pullls the results from the text classification algorithm on this data. These two pieces of data are fed to the markup.text() function above and the results are exported as output to the text1 variable as HTML.

output$text1 <- renderUI({
                data <- dataInput()
                results <- dataResults()

                if (is.null(input$file))
                        return(NULL)
                HTML(markup.text(results = results,
                            document = data))
        })

In the ui.R script, the output text1 information is returned within the body within a box() element using the htmlOutput() function. Here you can also see the infoBoxOutput() functions that return the overall probability scores for all classes

body <- dashboardBody(
        tabItems(
                tabItem(tabName = "text",
                        fluidRow(
                                infoBoxOutput("classification1"),
                                infoBoxOutput("classification2"),
                                infoBoxOutput("classification3")
                        ),
                        fluidRow(
                                box(width = 12,
                                    valstatus = "primary",
                                    solidHeader = TRUE,
                                    title = "Classification results",
                                    htmlOutput("text1", inline = TRUE),
                                    footer = textOutput("model"))
                        )
                )
        )
)

The results look like this.

center

The next step is to try to make each feature a link to a keyword-in-context interface showing the usage of this linguistic item in context.

sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] graphics  grDevices utils     datasets  methods   stats     base     
## 
## other attached packages:
## [1] pander_0.5.2  knitr_1.11    ggplot2_1.0.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.0      digest_0.6.8     MASS_7.3-43      grid_3.2.1      
##  [5] plyr_1.8.3       gtable_0.1.2     formatR_1.2      magrittr_1.5    
##  [9] evaluate_0.7.2   scales_0.2.5     stringi_0.5-5    reshape2_1.4.1  
## [13] proto_0.3-10     tools_3.2.1      stringr_1.0.0    munsell_0.4.2   
## [17] colorspace_1.2-6