Enumerating words can be helpful. With a simple command (tokens_ngrams), in R we also got lists of word pairs and triples and so on. But usually we want to know which words appear within the same documents. What about co-occurences within the same song? Which words appear in the same songs? With the fcm() command in R we obtain a neat little table, more or less like the following.
ich du gut liebe kalt bitte sonne
ich 2775 1289 474 550 352 25 143
du 0 920 209 183 279 25 19
gut 0 0 91 19 0 2 20
lust 627 1 2 0 280 0 1
deutschl 209 532 0 76 152 0 171
liebe 140 55 41 11
kalt 84 28 4
bitte 36 0
sonne 166
As a plot:

We see a bridge from "ich" to "du", to "lust", "liebe" and a lighter one to "kalt". This really looks like the word center of Rammstein songs.
From the table, we see that "liebe" has no connection to "ich" and "du", while "lust" co-occurs with "ich" and "kalt". Reflect!
Note: Some of the above numbers are very high because co-occurences are calculated within a V*V matrix. Hence, the square of the term frequency is the basis. For definitive results, we should calculate the relative frequency.
technically
Packages quanteda, readtext
my_stopwords <- stopwords("de")%>%
char_remove(pattern=c("ich", "du"))
toks_fund <- tokens(basis, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)%>%
tokens_remove(pattern = my_stopwords)
matrixa <- dfm(toks_fund)
toks_zusammen <- fcm(matrixa)
feat <- names(topfeatures(toks_zusammen, 20))
fcmat_tokszus_subset <- toks_zusammen %>%
fcm_select(feat)
tplot_zusammen <- textplot_network(fcmat_tokszus_subset)
set.seed(134)
tplot_zusammen
write.table(fcmat_tokszus_subset, file = "")
Commenti
Posta un commento