Passa ai contenuti principali

about stopwords, in machine life and in literature

 The practical part of introductions to Text Analysis with R usually starts to the sound of the hunting horn. "Tokenization" is the parole, and "elimination" the first goal. The analyst does not care about punctuation (remove_punct =TRUE)! The analyst has to reduce the burden of his word sack, and there he or she notices the "stopwords", functional words that connect and move ideas as words .. "then" and "when", "how" and "because". Any contents? No! Throw them away!

It might be true that programs for Data Analysis have been developed for, and are mostly used by marketing experts who do not really care about moving ideas, and usually end up with some "sentiment analysis": Feelin good? Ok!

It is also true that in AI, things like Natural Language Understanding, caution is the dominant rule. Keep the stopwords, you never know!

There are good reasons for doing so. AI should include every word that touches human beings, maybe even those, strangely enough, we find in philosophical words. Stopwords, to tell the truth.

Take the conclusive sentence of he first paragraph of Hegel's "Science of Logic", "Being, the indeterminate immediate is in fact nothing, and neither more nor less than nothing". 

Having eliminated stopwords, and punctuation ... with

    > library ("quanteda")                                                        

    > tokens(word_chain, remove_punct=TRUE)%>%                         n      >        tokens_remove(pattern = stopwords("en"))

only a sad rest is left:

[1] "indeterminate" "immediate"    

[3] "fact"          "nothing"      

[5] "neither"       "less"         

[7] "nothing"

The subject "being" has vanished, but "nothing" remained. Nonsense.

Philosophy is all about stopwords. But not only concepts, also the sound, i.e. the aesthetic aspect of language, is influenced by stopwords. 

Manzoni's famous novel "The Betrothed" starts with the words: "quel ramo del lago di Como che volge a mezzogiorno": "that branch of Lake Como that turns towards noon". "Remove stopwords" (Quanteda) turns the English version into:

[1] "branch" "Lake" "Como" "turns" [5] "towards" "noon"

"Towards" has not been eliminated ... in Italian the elimination procedure does not work better:

[1] "quel"        "ramo"        "lago"       

[4] "Como"        "volge"       "mezzogiorno".

Here we are keeping "quel", a functional word which is probably considered old fashioned and therefore has not been enumerated in the list of stopwords.

Neglecting the inconsistencies, we should notice a particular trait of Manzoni's first words: Functional and content words are taking turns: F C F C ..

A British example? The beginning of "Finnegans's Wake":

"riverrun, past Eve and Adam’s, from swerve of shore to bend of bay". 

[1] "riverrun" "past" "Eve" [4] "Adam's" "swerve" "shore" [7] "bend" "bay".

"Past" taken as a "content word", well.

There is a reason why people remember the beginnings of certain literary works. It is a question of rhythm, and maybe marketing experts could take some interest, sooner or later.

Commenti

Post popolari in questo blog

Till Lindemann as a Poet and with his Band (1)

Till Lindemann, lead singer of the Rammstein group, has also published some collections of poems. You should not forget: he is the son of Werner Lindemann, who used to be a prominent writer in the times of the socialist German Democratic Rebublic.  As I am (together with Claudia Lisa Moeller) translating one of these books,  "In stillen Nächten"/ "On Quiet Nights" (English translation by Ehren Fordyce,  Raw Dog Screaming Press, Bowie MD in 2025)  into Italian, I took out my R Quanteda package in order to take a more distant view on the poems.  The most frequent words ("stopwords" excluded) in Lindemann's poems are "Herz" ("heart"), which appears 33 times, and "Liebe" ("love") with 26 occurrences. This might seem similar to the frequencies in Rammstein songs, where we read the leading word "Liebe" 46 times, followed  by "Mann"/ man (45).  We might assume that, since "Herz" is often ta...

Between Goethe and Brecht. Rammstein texts and the poems of Till Lindemann

Till Lindemann is a poet better known as the song writer and lead singer of Rammstein. It  could be interesting to compare the texts band’s lyrics and his poems, maybe gaining a better  insight in how both of them are made. For automated analysis, however, both kinds of texts might seem of little interest. With very  short texts, the basis for statistical reasoning is too small. Indeed, we cannot reasonably apply  the various readability indexes we usually employ when analyzing corpora. But if we restrict  our statistical glimpse to some very elementary calculations, something similar to an  “author’s footprint” might emerge.  I will first use the TTR, the ratio between  the number of single words (“types”) that appear in the text and the total number of words  (“tokens”).Then I will have a look at the numeric relation between functional words  (prepositions articles, conjunctions, etc.) and content words. Some characteristics of Till...

Co-occurrences with keyword lists

Playing R with Rammstein texts can be fun when the outcomes are unexpected, and you get plots like Gut, better. We might even see: How do we manage making "das Gute" appear? From general to keyword plots In the plot based on all the co-occurrences  in Rammstein song texts, we can identify a center on the left-hand side: The meaningful central terms seem to be "Lust", "Deutschland", "Liebe", "ich", "du", "kalt" and "gut". Choosing these words as keys, we can plot a new picture. What do we see? There is a clear "ich" - "du" axis. "Liebe" appears together with "ich" and "du", well. But "Lust" seems to be mainly referred only to "ich", as well as "kalt".  Into the contexts Finding "kalt" surprising, even at the border of a net with "ich", "du". "Liebe", "Lust", I searched for the occ...