Passa ai contenuti principali

Rammstein read by the machine (3): Liebe lieben

 As we have seen, love ("liebe") is, once the stopwords are eliminated,  the most frequent feature in Rammstein texts. Among all the nouns indicating elementary states and matters, like "Wasser", "Sonne", "Licht" and "Lust", "Love" looks surprising. Usually, we do not consider love an elementary feeling. Rammstein, as far as it seems, do.

But still, we have to be cautious, as we do not even know whether the feature"liebe" corresponds to a noun or to a verb. 

With a case-sensitive context research we obtain

"Liebe"   28 times

 "liebe"   18 occurrences.

In "was ich liebe" (2019): "ich liebe nicht, dass ich was liebe", in "OK" (2019): "ich liebe dich, wenn du mich lässt". In the same song, we hear "was sich liebt, das darf sich lecken". Various forms of the same verb, which are counted separately. We can check them, one by one, and get: 0 "liebst" (second person) and 13 "liebt"(third person). But in order to understand these numbers, here we need a comparison to other words, not in their singular appearances, but in some basic form. Which one? How?

The first procedure usually proposed in R text analysis is called "stemming". In the way we understand stems in German, we would expect a reduction like "gesehen" to "sah". from "gebrochenem" to "brach". But stemming in data science essentially means cutting of some endings. 

The features "liebe",  "wildes", "tier", "gebrochenem",  "herzen" are simply transformed to "lieb",  "wild",  "gebroch",  "herz".  Snowball does not even recognize -t as conjugation ending, leaving "liebt" untouched. This looks pretty useless for the purpose of text analysis. The textstat_frequency command indeed gives the same result as before. 

By its own, a computer program cannot derive a stem like "brach" or "brechen" from "gebrochen". 

What we need should have the form of a dictionary. The entries should make the program find a connection between features and underlying elements. This way, we could achieve a reduction of the number of different words to some meaningful units. This procedure is called lemmatization. Under the name of UDPipe the Charles University of Prague offers a lemmatizing and parsing program for German. 

For the first verse of the song "Adieu" on the UDPipe test page, we get: 

# text = Nur der Tod währt alle Zeit 1 Nur Nur ADV ADV _ 3 advmod _ TokenRange=0:3 2 der der DET ART Case=Nom|Gender=Masc|Number=Sing|PronType=Art 3 det _ TokenRange=4:7 3 Tod Tod NOUN NN Gender=Masc|Number=Sing|Person=3 4 nsubj _ TokenRange=8:11 4 währt währen VERB VVFIN Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ TokenRange=12:17 5 alle all DET PIDAT Case=Acc|Gender=Fem|Number=Sing|Person=3 6 det _ TokenRange=18:22 6 Zeit Zeit NOUN NN Gender=Fem|Number=Sing|Person=3 4 obj _ SpacesAfter=\r\n|TokenRange=23:27

A lot of material for textual analysis.

According to UDPipe, the most frequent lemmata in our Rammstein text collection are:

 feature frequency rank docfreq group

1    sehen        60    1      60   all

2      gut        57    2      57   all

3    gehen        51    3      51   all

4   kommen        44    4      44   all

5     mann        43    5      43   all

6    immer        42    6      42   all

7       ja        42    6      42   all

8   lieben        40    8      40   all

9     lieb        40    8      40   all

10   schön        40    8      40   all

11    kalt        38   11      38   all

12   licht        38   11      38   all

13    lust        37   13      37   all

14  sichen        36   14      36   all

15    herz        36   14      36   all

Still, the table evidences some problems. "liebe" and "lieb" seems to be counted twice. A lemma "sichen" looks quite strange.

With which(x$lemma=="sichen"), for "sichen" we always get the token = "gesicht". Clever program, too little knowledge. 

A similar request for "lieb" and "lieben" instead shows that the latter indeed refers to the verb "lieben" and the first one to the noun. Taken as one lemma "lieb", this means 80 occurrences of the love element.

First place for love, both as a verb and as noun. 

In evidence, too, that verbs are numerically dominating Rammstein texts. 


Technically

Stemming with snowballC

library(SnowballC)

library ("quanteda")

library ("readtext")

toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE,

remove_symbol = TRUE)%>%

tokens_remove(pattern = stopwords("de"))

wordStem(toks_fund, language="de")



Lemmatizing with UDPipe

Straka, Milan (2020, February 28). UDPipe German: Morphosyntactic Analysis of Raw Text. Charles University. [Software (Tool/Service)].

library ("quanteda")

udmodel <- udpipe_download_model(language = "german")

udmodel <- udpipe_load_model(file = udmodel$file_model)

basis <-readtext("/Users/Achim/Desktop/Rammstein/*.txt",

docvarsfrom = "filenames")

basis <- corpus(basis)

toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol

= TRUE)%>%

tokens_remove(pattern = stopwords("de"))%>%

tokens_tolower()

#from: https://bnosac.github.io/udpipe/docs/doc2.html

stoff <- sapply(toks_fund, FUN=function(x){

x <- gsub(" ", intToUtf8(160), x) ## replace space with no-break-space

paste(x, collapse = " ")

})

x <- udpipe_annotate(udmodel, x = as.character(stoff), tokenizer = "horizontal")

x <- as.data.frame(x)

lemmata <- dfm(tokens(x$lemma))

textstat_frequency(lemmata, n=20)






Commenti

Post popolari in questo blog

Co-occurrences with keyword lists

Playing R with Rammstein texts can be fun when the outcomes are unexpected, and you get plots like Gut, better. We might even see: How do we manage making "das Gute" appear? From general to keyword plots In the plot based on all the co-occurrences  in Rammstein song texts, we can identify a center on the left-hand side: The meaningful central terms seem to be "Lust", "Deutschland", "Liebe", "ich", "du", "kalt" and "gut". Choosing these words as keys, we can plot a new picture. What do we see? There is a clear "ich" - "du" axis. "Liebe" appears together with "ich" and "du", well. But "Lust" seems to be mainly referred only to "ich", as well as "kalt".  Into the contexts Finding "kalt" surprising, even at the border of a net with "ich", "du". "Liebe", "Lust", I searched for the occ...

about stopwords, in machine life and in literature

  The practical part of introductions to Text Analysis with R usually starts to the sound of the hunting horn. "Tokenization" is the parole, and "elimination" the first goal. The analyst does not care about punctuation (remove_punct =TRUE)! The analyst has to reduce the burden of his word sack, and there he or she notices the "stopwords", functional words that connect and move ideas as words .. "then" and "when", "how" and "because". Any contents? No! Throw them away! It might be true that programs for Data Analysis have been developed for, and are mostly used by marketing experts who do not really care about moving ideas, and usually end up with some "sentiment analysis": Feelin good? Ok! It is also true that in AI, things like Natural Language Understanding, caution is the dominant rule. Keep the stopwords, you never know! There are good reasons for doing so. AI should include every word that touches huma...

the Rammstein Mystery: Analysing songtexts with R (1)

  When we take a look at the most frequent words with a wordcloud (wordmax =20), having eliminated the stopwords, the most evident ones describe elementary states: "Lust", "Kälte", "Licht", "Mann"  Within this list, "Liebe" could be surprising, being or not being an elementary state? "Herz" of symbolic quality. "America" only appears in one song. Numerically, it looks like this: feature frequency docfreq      relative frequency   relative ranking 1 liebe 46                   11                     0.63428098                    1 2 mann 45                 11                       0.46447375                    6 3 g...