Rammstein read by the machine (3): Liebe lieben

As we have seen, love ("liebe") is, once the stopwords are eliminated, the most frequent feature in Rammstein texts. Among all the nouns indicating elementary states and matters, like "Wasser", "Sonne", "Licht" and "Lust", "Love" looks surprising. Usually, we do not consider love an elementary feeling. Rammstein, as far as it seems, do.

But still, we have to be cautious, as we do not even know whether the feature"liebe" corresponds to a noun or to a verb.

With a case-sensitive context research we obtain

"Liebe" 28 times

"liebe" 18 occurrences.

In "was ich liebe" (2019): "ich liebe nicht, dass ich was liebe", in "OK" (2019): "ich liebe dich, wenn du mich lässt". In the same song, we hear "was sich liebt, das darf sich lecken". Various forms of the same verb, which are counted separately. We can check them, one by one, and get: 0 "liebst" (second person) and 13 "liebt"(third person). But in order to understand these numbers, here we need a comparison to other words, not in their singular appearances, but in some basic form. Which one? How?

The first procedure usually proposed in R text analysis is called "stemming". In the way we understand stems in German, we would expect a reduction like "gesehen" to "sah". from "gebrochenem" to "brach". But stemming in data science essentially means cutting of some endings.

The features "liebe", "wildes", "tier", "gebrochenem", "herzen" are simply transformed to "lieb", "wild", "gebroch", "herz". Snowball does not even recognize -t as conjugation ending, leaving "liebt" untouched. This looks pretty useless for the purpose of text analysis. The textstat_frequency command indeed gives the same result as before.

By its own, a computer program cannot derive a stem like "brach" or "brechen" from "gebrochen".

What we need should have the form of a dictionary. The entries should make the program find a connection between features and underlying elements. This way, we could achieve a reduction of the number of different words to some meaningful units. This procedure is called lemmatization. Under the name of UDPipe the Charles University of Prague offers a lemmatizing and parsing program for German.

For the first verse of the song "Adieu" on the UDPipe test page, we get:

A lot of material for textual analysis.

According to UDPipe, the most frequent lemmata in our Rammstein text collection are:

feature frequency rank docfreq group

1 sehen 60 1 60 all

2 gut 57 2 57 all

3 gehen 51 3 51 all

4 kommen 44 4 44 all

5 mann 43 5 43 all

6 immer 42 6 42 all

7 ja 42 6 42 all

8 lieben 40 8 40 all

9 lieb 40 8 40 all

10 schön 40 8 40 all

11 kalt 38 11 38 all

12 licht 38 11 38 all

13 lust 37 13 37 all

14 sichen 36 14 36 all

15 herz 36 14 36 all

Still, the table evidences some problems. "liebe" and "lieb" seems to be counted twice. A lemma "sichen" looks quite strange.

With which(x$lemma=="sichen"), for "sichen" we always get the token = "gesicht". Clever program, too little knowledge.

A similar request for "lieb" and "lieben" instead shows that the latter indeed refers to the verb "lieben" and the first one to the noun. Taken as one lemma "lieb", this means 80 occurrences of the love element.

First place for love, both as a verb and as noun.

In evidence, too, that verbs are numerically dominating Rammstein texts.

Technically

Stemming with snowballC

library(SnowballC)

library ("quanteda")

library ("readtext")

toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE,

remove_symbol = TRUE)%>%

tokens_remove(pattern = stopwords("de"))

wordStem(toks_fund, language="de")

Lemmatizing with UDPipe

Straka, Milan (2020, February 28). UDPipe German: Morphosyntactic Analysis of Raw Text. Charles University. [Software (Tool/Service)].

library ("quanteda")

udmodel <- udpipe_download_model(language = "german")

udmodel <- udpipe_load_model(file = udmodel$file_model)

basis <-readtext("/Users/Achim/Desktop/Rammstein/*.txt",

docvarsfrom = "filenames")

basis <- corpus(basis)

toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol

= TRUE)%>%

tokens_remove(pattern = stopwords("de"))%>%

tokens_tolower()

#from: https://bnosac.github.io/udpipe/docs/doc2.html

stoff <- sapply(toks_fund, FUN=function(x){

x <- gsub(" ", intToUtf8(160), x) ## replace space with no-break-space

paste(x, collapse = " ")

})

x <- udpipe_annotate(udmodel, x = as.character(stoff), tokenizer = "horizontal")

x <- as.data.frame(x)

lemmata <- dfm(tokens(x$lemma))

textstat_frequency(lemmata, n=20)

Rammstein Read by the Machine

Cerca nel blog

Rammstein read by the machine (3): Liebe lieben

Etichette

Commenti

Posta un commento

Post popolari in questo blog

Till Lindemann as a Poet and with his Band (1)

Between Goethe and Brecht. Rammstein texts and the poems of Till Lindemann

Co-occurrences with keyword lists