Passa ai contenuti principali

Rammstein read by the machine (3): Liebe lieben

 As we have seen, love ("liebe") is, once the stopwords are eliminated,  the most frequent feature in Rammstein texts. Among all the nouns indicating elementary states and matters, like "Wasser", "Sonne", "Licht" and "Lust", "Love" looks surprising. Usually, we do not consider love an elementary feeling. Rammstein, as far as it seems, do.

But still, we have to be cautious, as we do not even know whether the feature"liebe" corresponds to a noun or to a verb. 

With a case-sensitive context research we obtain

"Liebe"   28 times

 "liebe"   18 occurrences.

In "was ich liebe" (2019): "ich liebe nicht, dass ich was liebe", in "OK" (2019): "ich liebe dich, wenn du mich lässt". In the same song, we hear "was sich liebt, das darf sich lecken". Various forms of the same verb, which are counted separately. We can check them, one by one, and get: 0 "liebst" (second person) and 13 "liebt"(third person). But in order to understand these numbers, here we need a comparison to other words, not in their singular appearances, but in some basic form. Which one? How?

The first procedure usually proposed in R text analysis is called "stemming". In the way we understand stems in German, we would expect a reduction like "gesehen" to "sah". from "gebrochenem" to "brach". But stemming in data science essentially means cutting of some endings. 

The features "liebe",  "wildes", "tier", "gebrochenem",  "herzen" are simply transformed to "lieb",  "wild",  "gebroch",  "herz".  Snowball does not even recognize -t as conjugation ending, leaving "liebt" untouched. This looks pretty useless for the purpose of text analysis. The textstat_frequency command indeed gives the same result as before. 

By its own, a computer program cannot derive a stem like "brach" or "brechen" from "gebrochen". 

What we need should have the form of a dictionary. The entries should make the program find a connection between features and underlying elements. This way, we could achieve a reduction of the number of different words to some meaningful units. This procedure is called lemmatization. Under the name of UDPipe the Charles University of Prague offers a lemmatizing and parsing program for German. 

For the first verse of the song "Adieu" on the UDPipe test page, we get: 

# text = Nur der Tod währt alle Zeit 1 Nur Nur ADV ADV _ 3 advmod _ TokenRange=0:3 2 der der DET ART Case=Nom|Gender=Masc|Number=Sing|PronType=Art 3 det _ TokenRange=4:7 3 Tod Tod NOUN NN Gender=Masc|Number=Sing|Person=3 4 nsubj _ TokenRange=8:11 4 währt währen VERB VVFIN Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ TokenRange=12:17 5 alle all DET PIDAT Case=Acc|Gender=Fem|Number=Sing|Person=3 6 det _ TokenRange=18:22 6 Zeit Zeit NOUN NN Gender=Fem|Number=Sing|Person=3 4 obj _ SpacesAfter=\r\n|TokenRange=23:27

A lot of material for textual analysis.

According to UDPipe, the most frequent lemmata in our Rammstein text collection are:

 feature frequency rank docfreq group

1    sehen        60    1      60   all

2      gut        57    2      57   all

3    gehen        51    3      51   all

4   kommen        44    4      44   all

5     mann        43    5      43   all

6    immer        42    6      42   all

7       ja        42    6      42   all

8   lieben        40    8      40   all

9     lieb        40    8      40   all

10   schön        40    8      40   all

11    kalt        38   11      38   all

12   licht        38   11      38   all

13    lust        37   13      37   all

14  sichen        36   14      36   all

15    herz        36   14      36   all

Still, the table evidences some problems. "liebe" and "lieb" seems to be counted twice. A lemma "sichen" looks quite strange.

With which(x$lemma=="sichen"), for "sichen" we always get the token = "gesicht". Clever program, too little knowledge. 

A similar request for "lieb" and "lieben" instead shows that the latter indeed refers to the verb "lieben" and the first one to the noun. Taken as one lemma "lieb", this means 80 occurrences of the love element.

First place for love, both as a verb and as noun. 

In evidence, too, that verbs are numerically dominating Rammstein texts. 


Technically

Stemming with snowballC

library(SnowballC)

library ("quanteda")

library ("readtext")

toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE,

remove_symbol = TRUE)%>%

tokens_remove(pattern = stopwords("de"))

wordStem(toks_fund, language="de")



Lemmatizing with UDPipe

Straka, Milan (2020, February 28). UDPipe German: Morphosyntactic Analysis of Raw Text. Charles University. [Software (Tool/Service)].

library ("quanteda")

udmodel <- udpipe_download_model(language = "german")

udmodel <- udpipe_load_model(file = udmodel$file_model)

basis <-readtext("/Users/Achim/Desktop/Rammstein/*.txt",

docvarsfrom = "filenames")

basis <- corpus(basis)

toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol

= TRUE)%>%

tokens_remove(pattern = stopwords("de"))%>%

tokens_tolower()

#from: https://bnosac.github.io/udpipe/docs/doc2.html

stoff <- sapply(toks_fund, FUN=function(x){

x <- gsub(" ", intToUtf8(160), x) ## replace space with no-break-space

paste(x, collapse = " ")

})

x <- udpipe_annotate(udmodel, x = as.character(stoff), tokenizer = "horizontal")

x <- as.data.frame(x)

lemmata <- dfm(tokens(x$lemma))

textstat_frequency(lemmata, n=20)






Commenti

Post popolari in questo blog

A Word Never Comes Alone. A Glance at Cooccurences

Enumerating words can be helpful. With a simple command (tokens_ngrams), in R we also got lists of word pairs and triples and so on. But usually we want to know which words appear within the same documents. W hat about co-occurences within the same song? Which words appear in the same songs? With the fcm() command in R we obtain a neat little table, more or less like the following .                      ich      du      gut      liebe      kalt      bitte      sonne ich              2775 1289  474    550      352      25       143 du             0       920  209   183       279      25       19 gut             0        0       91        19          0           2           20 lust           627    1     2         0           280     0          1 deutschl  209    532  0         76        152      0         171 liebe                                             140        55        41           11 kalt                                                                  84        28        

"Ich" and "du" as stopwords?

Usually, text analyzing programs consider "ich" ("I", "me") a stopword, a functional word without special meaning.  This is due to the fact that grammars categorize "ich" as a pronoun. But, as Eugen Coseriu stated in his book "Introduction to the linguistics of texts" (German edition 1985), this does not correspond to the real use of this word.  While "he" in  "Ralf is tired. He will go to bed soon"  is a substitute, i.e. pro-noun, for "Ralf", "I" in  "Ralf is tired. I am going to bed"  is not.  The "I " here is understood as reference to a second person. T he same is true of  "du "  ( "you " ).  "Ich " and  "Du ",  these two words have a deictic function, they indicate somebody. Hence, we should not treat them as functional words or eliminate them as stopwords. Especially in literary texts, which have, according not only to Habermas (198