Passa ai contenuti principali

about stopwords, in machine life and in literature

 The practical part of introductions to Text Analysis with R usually starts to the sound of the hunting horn. "Tokenization" is the parole, and "elimination" the first goal. The analyst does not care about punctuation (remove_punct =TRUE)! The analyst has to reduce the burden of his word sack, and there he or she notices the "stopwords", functional words that connect and move ideas as words .. "then" and "when", "how" and "because". Any contents? No! Throw them away!

It might be true that programs for Data Analysis have been developed for, and are mostly used by marketing experts who do not really care about moving ideas, and usually end up with some "sentiment analysis": Feelin good? Ok!

It is also true that in AI, things like Natural Language Understanding, caution is the dominant rule. Keep the stopwords, you never know!

There are good reasons for doing so. AI should include every word that touches human beings, maybe even those, strangely enough, we find in philosophical words. Stopwords, to tell the truth.

Take the conclusive sentence of he first paragraph of Hegel's "Science of Logic", "Being, the indeterminate immediate is in fact nothing, and neither more nor less than nothing". 

Having eliminated stopwords, and punctuation ... with

    > library ("quanteda")                                                        

    > tokens(word_chain, remove_punct=TRUE)%>%                         n      >        tokens_remove(pattern = stopwords("en"))

only a sad rest is left:

[1] "indeterminate" "immediate"    

[3] "fact"          "nothing"      

[5] "neither"       "less"         

[7] "nothing"

The subject "being" has vanished, but "nothing" remained. Nonsense.

Philosophy is all about stopwords. But not only concepts, also the sound, i.e. the aesthetic aspect of language, is influenced by stopwords. 

Manzoni's famous novel "The Betrothed" starts with the words: "quel ramo del lago di Como che volge a mezzogiorno": "that branch of Lake Como that turns towards noon". "Remove stopwords" (Quanteda) turns the English version into:

[1] "branch" "Lake" "Como" "turns" [5] "towards" "noon"

"Towards" has not been eliminated ... in Italian the elimination procedure does not work better:

[1] "quel"        "ramo"        "lago"       

[4] "Como"        "volge"       "mezzogiorno".

Here we are keeping "quel", a functional word which is probably considered old fashioned and therefore has not been enumerated in the list of stopwords.

Neglecting the inconsistencies, we should notice a particular trait of Manzoni's first words: Functional and content words are taking turns: F C F C ..

A British example? The beginning of "Finnegans's Wake":

"riverrun, past Eve and Adam’s, from swerve of shore to bend of bay". 

[1] "riverrun" "past" "Eve" [4] "Adam's" "swerve" "shore" [7] "bend" "bay".

"Past" taken as a "content word", well.

There is a reason why people remember the beginnings of certain literary works. It is a question of rhythm, and maybe marketing experts could take some interest, sooner or later.

Commenti

Post popolari in questo blog

the Rammstein Mystery: Analysing songtexts with R (1)

  When we take a look at the most frequent words with a wordcloud (wordmax =20), having eliminated the stopwords, the most evident ones describe elementary states: "Lust", "Kälte", "Licht", "Mann"  Within this list, "Liebe" could be surprising, being or not being an elementary state? "Herz" of symbolic quality. "America" only appears in one song. Numerically, it looks like this: feature frequency docfreq      relative frequency   relative ranking 1 liebe 46                   11                     0.63428098                    1 2 mann 45                 11                       0.46447375                    6 3 g...

A Word Never Comes Alone. A Glance at Cooccurences

Enumerating words can be helpful. With a simple command (tokens_ngrams), in R we also got lists of word pairs and triples and so on. But usually we want to know which words appear within the same documents. W hat about co-occurences within the same song? Which words appear in the same songs? With the fcm() command in R we obtain a neat little table, more or less like the following .                      ich      du      gut      liebe      kalt      bitte      sonne ich              2775 1289  474    550      352      25       143 du             0       920  209   183       279      25       19 gut  ...

"Ich" and "du" as stopwords?

Usually, text analyzing programs consider "ich" ("I", "me") a stopword, a functional word without special meaning.  This is due to the fact that grammars categorize "ich" as a pronoun. But, as Eugen Coseriu stated in his book "Introduction to the linguistics of texts" (German edition 1985), this does not correspond to the real use of this word.  While "he" in  "Ralf is tired. He will go to bed soon"  is a substitute, i.e. pro-noun, for "Ralf", "I" in  "Ralf is tired. I am going to bed"  is not.  The "I " here is understood as reference to a second person. T he same is true of  "du "  ( "you " ).  "Ich " and  "Du ",  these two words have a deictic function, they indicate somebody. Hence, we should not treat them as functional words or eliminate them as stopwords. Especially in literary texts, which have, according not only to Habermas (198...