As we have seen, love ("liebe") is, once the stopwords are eliminated, the most frequent feature in Rammstein texts. Among all the nouns indicating elementary states and matters, like "Wasser", "Sonne", "Licht" and "Lust", "Love" looks surprising. Usually, we do not consider love an elementary feeling. Rammstein, as far as it seems, do.
But still, we have to be cautious, as we do not even know whether the feature"liebe" corresponds to a noun or to a verb.
With a case-sensitive context research we obtain
"Liebe" 28 times
"liebe" 18 occurrences.
In "was ich liebe" (2019): "ich liebe nicht, dass ich was liebe", in "OK" (2019): "ich liebe dich, wenn du mich lässt". In the same song, we hear "was sich liebt, das darf sich lecken". Various forms of the same verb, which are counted separately. We can check them, one by one, and get: 0 "liebst" (second person) and 13 "liebt"(third person). But in order to understand these numbers, here we need a comparison to other words, not in their singular appearances, but in some basic form. Which one? How?
The first procedure usually proposed in R text analysis is called "stemming". In the way we understand stems in German, we would expect a reduction like "gesehen" to "sah". from "gebrochenem" to "brach". But stemming in data science essentially means cutting of some endings.
The features "liebe", "wildes", "tier", "gebrochenem", "herzen" are simply transformed to "lieb", "wild", "gebroch", "herz". Snowball does not even recognize -t as conjugation ending, leaving "liebt" untouched. This looks pretty useless for the purpose of text analysis. The textstat_frequency command indeed gives the same result as before.
By its own, a computer program cannot derive a stem like "brach" or "brechen" from "gebrochen".
What we need should have the form of a dictionary. The entries should make the program find a connection between features and underlying elements. This way, we could achieve a reduction of the number of different words to some meaningful units. This procedure is called lemmatization. Under the name of UDPipe the Charles University of Prague offers a lemmatizing and parsing program for German.
For the first verse of the song "Adieu" on the UDPipe test page, we get:
# text = Nur der Tod währt alle Zeit 1 Nur Nur ADV ADV _ 3 advmod _ TokenRange=0:3 2 der der DET ART Case=Nom|Gender=Masc|Number=Sing|PronType=Art 3 det _ TokenRange=4:7 3 Tod Tod NOUN NN Gender=Masc|Number=Sing|Person=3 4 nsubj _ TokenRange=8:11 4 währt währen VERB VVFIN Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ TokenRange=12:17 5 alle all DET PIDAT Case=Acc|Gender=Fem|Number=Sing|Person=3 6 det _ TokenRange=18:22 6 Zeit Zeit NOUN NN Gender=Fem|Number=Sing|Person=3 4 obj _ SpacesAfter=\r\n|TokenRange=23:27
A lot of material for textual analysis.
According to UDPipe, the most frequent lemmata in our Rammstein text collection are:
feature frequency rank docfreq group
1 sehen 60 1 60 all
2 gut 57 2 57 all
3 gehen 51 3 51 all
4 kommen 44 4 44 all
5 mann 43 5 43 all
6 immer 42 6 42 all
7 ja 42 6 42 all
8 lieben 40 8 40 all
9 lieb 40 8 40 all
10 schön 40 8 40 all
11 kalt 38 11 38 all
12 licht 38 11 38 all
13 lust 37 13 37 all
14 sichen 36 14 36 all
15 herz 36 14 36 all
Still, the table evidences some problems. "liebe" and "lieb" seems to be counted twice. A lemma "sichen" looks quite strange.
With which(x$lemma=="sichen"), for "sichen" we always get the token = "gesicht". Clever program, too little knowledge.
A similar request for "lieb" and "lieben" instead shows that the latter indeed refers to the verb "lieben" and the first one to the noun. Taken as one lemma "lieb", this means 80 occurrences of the love element.
First place for love, both as a verb and as noun.
In evidence, too, that verbs are numerically dominating Rammstein texts.
Technically
Stemming with snowballC
library(SnowballC)
library ("quanteda")
library ("readtext")
toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE,
remove_symbol = TRUE)%>%
tokens_remove(pattern = stopwords("de"))
wordStem(toks_fund, language="de")
Lemmatizing with UDPipe
Straka, Milan (2020, February 28). UDPipe German: Morphosyntactic Analysis of Raw Text. Charles University. [Software (Tool/Service)].
library ("quanteda")
udmodel <- udpipe_download_model(language = "german")
udmodel <- udpipe_load_model(file = udmodel$file_model)
basis <-readtext("/Users/Achim/Desktop/Rammstein/*.txt",
docvarsfrom = "filenames")
basis <- corpus(basis)
toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol
= TRUE)%>%
tokens_remove(pattern = stopwords("de"))%>%
tokens_tolower()
#from: https://bnosac.github.io/udpipe/docs/doc2.html
stoff <- sapply(toks_fund, FUN=function(x){
x <- gsub(" ", intToUtf8(160), x) ## replace space with no-break-space
paste(x, collapse = " ")
})
x <- udpipe_annotate(udmodel, x = as.character(stoff), tokenizer = "horizontal")
x <- as.data.frame(x)
lemmata <- dfm(tokens(x$lemma))
textstat_frequency(lemmata, n=20)
Commenti
Posta un commento