Passa ai contenuti principali

Co-occurrences with keyword lists

Playing R with Rammstein texts can be fun when the outcomes are unexpected, and you get plots like

Gut, better. We might even see:

How do we manage making "das Gute" appear?


From general to keyword plots

In the plot based on all the co-occurrences in Rammstein song texts, we can identify a center on the left-hand side:

The meaningful central terms seem to be

"Lust", "Deutschland", "Liebe", "ich", "du", "kalt" and "gut". Choosing these words as keys, we can plot a new picture.



What do we see? There is a clear "ich" - "du" axis. "Liebe" appears together with "ich" and "du", well. But "Lust" seems to be mainly referred only to "ich", as well as "kalt". 

Into the contexts

Finding "kalt" surprising, even at the border of a net with "ich", "du". "Liebe", "Lust", I searched for the occurrences of this adjective. While the noun, "Kälte" does not appear at all, a KWIC-search gives 32 occurences of "kalt" instead. We can transform the immediate context (window= 10) into a matrix, count and plot the co-occurences. 

The connection between "ich" and "kalt" looks interesting and might be opposed to "ihr" and "heiß", but we are talking about very low absolute frequencies. 

       feature frequency rank docfreq group
1         kalt       122    1        19   all
2          ich        47    2         25   all
3         heiß       20    3           5   all
4           du        18    4           9   all
5          ihr         18    4          7   all
6         seid        18    4          5   all
7  deutschland  16    7          4   all
8        amour     12    8          3   all
9         ende      10    9          7   all
10        lust       10    9         10   all

Anyway, it might be interesting drawing similar context plots for "ich", "du", "Liebe" and "Lust". 

The context of the "personal pronoun" (in fact it is a deictic, an indicating, term (Coseriu)) "ich" seemingly does not reveal surprises. "ich", "du", "Lust", "Liebe" again.

But where does the adjective "gut" come from? 
Focussing on words of interest, the importance of "gut" becomes clearer:

Partly, this may be caused by the repetitions in the famous song "Du riechst so gut". But the link between "ich" and "gut" seems to be stronger. Indeed, the 43 appearances of "gut" are distributed among 17 different songs. Without "riechst" (31 occurrences!), within the context of "gut", we get: 

Maybe a topic analysis would be helpful here. But the result of a simple and fast approach does not  look really helpful.  With a simple try, the output looks like 
      topic1     topic2    topic3      topic4   
 [1,] "ich"      "tut"     "ich"       "ich"    
 [2,] "du"       "ich"     "du"        "puppe"  
 [3,] "ihr"      "hört"    "gut"       "ab"     
 [4,] "kind"     "schreit" "riechst"   "kopf"   
 [5,] "schonste" "tu"      "steig"     "ja"     
 [6,] "blut"     "weh"     "geh"       "reiß"   
 [7,] "giftig"   "leid"    "hinterher" "geht"   
 [8,] "dass"     "himmel"  "finde"     "beiß"   
 [9,] "seh"      "gesicht" "gleich"    "hals"   
[10,] "ach"      "ende"    "warte"     "dam-dam"

Topic analysis should be done more extensively. 

Reading words like "ach", "schrei", "giftig", "Leid": We might consider a sentiment analysis. The triangle "ich", "du" and "gut" probably does not give the definitive analytical insight.

For now, a context search for "Lust" and "Liebe" has to be enough. 

The complete Lust plot

But, talking about contexts, "Lust" can be eliminated. Only with the keywords:

In (very low) numbers: 
Feature co-occurrence matrix
                  ich          du          kalt       gut
    ich         33477     259        2590     259

With the keyword "Liebe", again "ich" is central:

If I had lemmatized the tokens, "lieben" and "liebe" would have been one point. 
Without "Lust" and "Liebe":


Nice, is it not? In numbers:
features       du        ihr   ich      gut     kalt
    du           179     352   1179   18      72
    ihr            0        214   804     12      56
    ich            0           0    23710 3973 1697
    gut            0           0     0        225   154
    kalt           0           0     0              224

In the field of "Liebe", "ich" is good, not you. 
In the field of "Lust", "ich" is "kalt", while "du" does only rarely appear. Is this Rammstein?

A last glimpse at "du": 207 entries, and still "du hast" is dominating? Only 13 "du" are found in the famous song. 

There is something interesting to be investigated. For now, the "du" context without "du":


In numbers?
features   ich      ihr       gut         kalt     heiß
    ich      17023 2960   17760   6845     10915

In the field of "du", "ich" becomes "gut" and "heiß". Reflect. 





technically
everything is done with R quanteda here, see Benoit et.al.
context search
kontext1 <- kwic(toks_fund, "kalt", case_insensitive = TRUE, valuetype = "glob", window = 10)

context into matrix
dopo1 <- kontext1$post
dopom1 <- quanteda::tokens(dopo1)%>%
          dfm()

pre1 <- kontext1$pre
prem1 <- quanteda::tokens(pre1)%>%
        dfm()

#now bind the two matrices and, because of overlapping #columns, compress

umfeld <- cbind(dopom1, prem1)%>%
    dfm_compress()

relative frequency
clean_rel <- dfm_weight(umfeld, scheme = "prop")
textstat_frequency(clean_rel, n=10)

Topic analysis trial

for (k in 3:17) {

fund_lda <- textmodel_lda(umfeld, k)

print(terms(fund_lda, 10))}


Commenti

Post popolari in questo blog

A Word Never Comes Alone. A Glance at Cooccurences

Enumerating words can be helpful. With a simple command (tokens_ngrams), in R we also got lists of word pairs and triples and so on. But usually we want to know which words appear within the same documents. W hat about co-occurences within the same song? Which words appear in the same songs? With the fcm() command in R we obtain a neat little table, more or less like the following .                      ich      du      gut      liebe      kalt      bitte      sonne ich              2775 1289  474    550      352      25       143 du             0       920  209   183       279      25       19 gut             0        0       91        19          0           2           20 lust           627    1     2         0           280     0          1 deutschl  209    532  0         76        152      0         171 liebe                                             140        55        41           11 kalt                                                                  84        28        

"Ich" and "du" as stopwords?

Usually, text analyzing programs consider "ich" ("I", "me") a stopword, a functional word without special meaning.  This is due to the fact that grammars categorize "ich" as a pronoun. But, as Eugen Coseriu stated in his book "Introduction to the linguistics of texts" (German edition 1985), this does not correspond to the real use of this word.  While "he" in  "Ralf is tired. He will go to bed soon"  is a substitute, i.e. pro-noun, for "Ralf", "I" in  "Ralf is tired. I am going to bed"  is not.  The "I " here is understood as reference to a second person. T he same is true of  "du "  ( "you " ).  "Ich " and  "Du ",  these two words have a deictic function, they indicate somebody. Hence, we should not treat them as functional words or eliminate them as stopwords. Especially in literary texts, which have, according not only to Habermas (198

Rammstein read by the machine (3): Liebe lieben

  As   we have seen , love ("liebe") is, once the stopwords are eliminated,  the most frequent feature in Rammstein texts. Among all the nouns indicating elementary states and matters, like "Wasser", "Sonne", "Licht" and "Lust", "Love" looks surprising. Usually, we do not consider love an elementary feeling. Rammstein, as far as it seems, do. But still, we have to be cautious, as we do not even know whether the feature"liebe" corresponds to a noun or to a verb.  With a case-sensitive context research we obtain "Liebe"   28 times  "liebe"   18 occurrences. In "was ich liebe" (2019): "ich liebe nicht, dass ich was liebe", in "OK" (2019): "ich liebe dich, wenn du mich lässt". In the same song, we hear "was sich liebt, das darf sich lecken". Various forms of the same verb, which are counted separately. We can check them, one by one, and get: 0 "li