Passa ai contenuti principali

Rammstein read by the machine (3): Liebe lieben

 As we have seen, love ("liebe") is, once the stopwords are eliminated,  the most frequent feature in Rammstein texts. Among all the nouns indicating elementary states and matters, like "Wasser", "Sonne", "Licht" and "Lust", "Love" looks surprising. Usually, we do not consider love an elementary feeling. Rammstein, as far as it seems, do.

But still, we have to be cautious, as we do not even know whether the feature"liebe" corresponds to a noun or to a verb. 

With a case-sensitive context research we obtain

"Liebe"   28 times

 "liebe"   18 occurrences.

In "was ich liebe" (2019): "ich liebe nicht, dass ich was liebe", in "OK" (2019): "ich liebe dich, wenn du mich lässt". In the same song, we hear "was sich liebt, das darf sich lecken". Various forms of the same verb, which are counted separately. We can check them, one by one, and get: 0 "liebst" (second person) and 13 "liebt"(third person). But in order to understand these numbers, here we need a comparison to other words, not in their singular appearances, but in some basic form. Which one? How?

The first procedure usually proposed in R text analysis is called "stemming". In the way we understand stems in German, we would expect a reduction like "gesehen" to "sah". from "gebrochenem" to "brach". But stemming in data science essentially means cutting of some endings. 

The features "liebe",  "wildes", "tier", "gebrochenem",  "herzen" are simply transformed to "lieb",  "wild",  "gebroch",  "herz".  Snowball does not even recognize -t as conjugation ending, leaving "liebt" untouched. This looks pretty useless for the purpose of text analysis. The textstat_frequency command indeed gives the same result as before. 

By its own, a computer program cannot derive a stem like "brach" or "brechen" from "gebrochen". 

What we need should have the form of a dictionary. The entries should make the program find a connection between features and underlying elements. This way, we could achieve a reduction of the number of different words to some meaningful units. This procedure is called lemmatization. Under the name of UDPipe the Charles University of Prague offers a lemmatizing and parsing program for German. 

For the first verse of the song "Adieu" on the UDPipe test page, we get: 

# text = Nur der Tod währt alle Zeit 1 Nur Nur ADV ADV _ 3 advmod _ TokenRange=0:3 2 der der DET ART Case=Nom|Gender=Masc|Number=Sing|PronType=Art 3 det _ TokenRange=4:7 3 Tod Tod NOUN NN Gender=Masc|Number=Sing|Person=3 4 nsubj _ TokenRange=8:11 4 währt währen VERB VVFIN Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ TokenRange=12:17 5 alle all DET PIDAT Case=Acc|Gender=Fem|Number=Sing|Person=3 6 det _ TokenRange=18:22 6 Zeit Zeit NOUN NN Gender=Fem|Number=Sing|Person=3 4 obj _ SpacesAfter=\r\n|TokenRange=23:27

A lot of material for textual analysis.

According to UDPipe, the most frequent lemmata in our Rammstein text collection are:

 feature frequency rank docfreq group

1    sehen        60    1      60   all

2      gut        57    2      57   all

3    gehen        51    3      51   all

4   kommen        44    4      44   all

5     mann        43    5      43   all

6    immer        42    6      42   all

7       ja        42    6      42   all

8   lieben        40    8      40   all

9     lieb        40    8      40   all

10   schön        40    8      40   all

11    kalt        38   11      38   all

12   licht        38   11      38   all

13    lust        37   13      37   all

14  sichen        36   14      36   all

15    herz        36   14      36   all

Still, the table evidences some problems. "liebe" and "lieb" seems to be counted twice. A lemma "sichen" looks quite strange.

With which(x$lemma=="sichen"), for "sichen" we always get the token = "gesicht". Clever program, too little knowledge. 

A similar request for "lieb" and "lieben" instead shows that the latter indeed refers to the verb "lieben" and the first one to the noun. Taken as one lemma "lieb", this means 80 occurrences of the love element.

First place for love, both as a verb and as noun. 

In evidence, too, that verbs are numerically dominating Rammstein texts. 


Technically

Stemming with snowballC

library(SnowballC)

library ("quanteda")

library ("readtext")

toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE,

remove_symbol = TRUE)%>%

tokens_remove(pattern = stopwords("de"))

wordStem(toks_fund, language="de")



Lemmatizing with UDPipe

Straka, Milan (2020, February 28). UDPipe German: Morphosyntactic Analysis of Raw Text. Charles University. [Software (Tool/Service)].

library ("quanteda")

udmodel <- udpipe_download_model(language = "german")

udmodel <- udpipe_load_model(file = udmodel$file_model)

basis <-readtext("/Users/Achim/Desktop/Rammstein/*.txt",

docvarsfrom = "filenames")

basis <- corpus(basis)

toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol

= TRUE)%>%

tokens_remove(pattern = stopwords("de"))%>%

tokens_tolower()

#from: https://bnosac.github.io/udpipe/docs/doc2.html

stoff <- sapply(toks_fund, FUN=function(x){

x <- gsub(" ", intToUtf8(160), x) ## replace space with no-break-space

paste(x, collapse = " ")

})

x <- udpipe_annotate(udmodel, x = as.character(stoff), tokenizer = "horizontal")

x <- as.data.frame(x)

lemmata <- dfm(tokens(x$lemma))

textstat_frequency(lemmata, n=20)






Commenti

Post popolari in questo blog

Till Lindemann as a Poet and with his Band (1)

Till Lindemann, lead singer of the Rammstein group, has also published some collections of poems. You should not forget: he is the son of Werner Lindemann, who used to be a prominent writer in the times of the socialist German Democratic Rebublic.  As I am (together with Claudia Lisa Moeller) translating one of these books,  "In stillen Nächten"/ "On Quiet Nights" (English translation by Ehren Fordyce,  Raw Dog Screaming Press, Bowie MD in 2025)  into Italian, I took out my R Quanteda package in order to take a more distant view on the poems.  The most frequent words ("stopwords" excluded) in Lindemann's poems are "Herz" ("heart"), which appears 33 times, and "Liebe" ("love") with 26 occurrences. This might seem similar to the frequencies in Rammstein songs, where we read the leading word "Liebe" 46 times, followed  by "Mann"/ man (45).  We might assume that, since "Herz" is often ta...

Between Goethe and Brecht. Rammstein texts and the poems of Till Lindemann

Till Lindemann is a poet better known as the song writer and lead singer of Rammstein. It  could be interesting to compare the texts band’s lyrics and his poems, maybe gaining a better  insight in how both of them are made. For automated analysis, however, both kinds of texts might seem of little interest. With very  short texts, the basis for statistical reasoning is too small. Indeed, we cannot reasonably apply  the various readability indexes we usually employ when analyzing corpora. But if we restrict  our statistical glimpse to some very elementary calculations, something similar to an  “author’s footprint” might emerge.  I will first use the TTR, the ratio between  the number of single words (“types”) that appear in the text and the total number of words  (“tokens”).Then I will have a look at the numeric relation between functional words  (prepositions articles, conjunctions, etc.) and content words. Some characteristics of Till...

Co-occurrences with keyword lists

Playing R with Rammstein texts can be fun when the outcomes are unexpected, and you get plots like Gut, better. We might even see: How do we manage making "das Gute" appear? From general to keyword plots In the plot based on all the co-occurrences  in Rammstein song texts, we can identify a center on the left-hand side: The meaningful central terms seem to be "Lust", "Deutschland", "Liebe", "ich", "du", "kalt" and "gut". Choosing these words as keys, we can plot a new picture. What do we see? There is a clear "ich" - "du" axis. "Liebe" appears together with "ich" and "du", well. But "Lust" seems to be mainly referred only to "ich", as well as "kalt".  Into the contexts Finding "kalt" surprising, even at the border of a net with "ich", "du". "Liebe", "Lust", I searched for the occ...