Passa ai contenuti principali

"Ich" and "du" as stopwords?

Usually, text analyzing programs consider "ich" ("I", "me") a stopword, a functional word without special meaning.  This is due to the fact that grammars categorize "ich" as a pronoun. But, as Eugen Coseriu stated in his book "Introduction to the linguistics of texts" (German edition 1985), this does not correspond to the real use of this word. 

While "he" in 

"Ralf is tired. He will go to bed soon" 

is a substitute, i.e. pro-noun, for "Ralf", "I" in 

"Ralf is tired. I am going to bed" 

is not. The"I" here is understood as reference to a second person. The same is true of "du("you"). "Ich" and "Du", these two words have a deictic function, they indicate somebody. Hence, we should not treat them as functional words or eliminate them as stopwords. Especially in literary texts, which have, according not only to Habermas (1981), an expressive function, I and me and you could be rather important. 

Having eliminated "Ich" and "Du" from the list of quanteda stopwords, the word cloud looks different: 


The impression differs from the prior ones. It looks, as if a dialogical space were opened, wherein elementary things and feelings were strolling around. 


Further statistical representations:

As text statistics:

textstat_frequency(matrixa, n=20)

feature frequency rank docfreq group

1 ich     523         1         66 all

2 du     207         2         45 all

3 liebe 46         3         11 all

4 mann 45         4         11 all

5 gut     44         5         18 all

6 komm 43         6         9 all

7 ja         43         6        12 all

8 immer 42         8         19 all

9 sonne 42         8         15 all

10 nacht 41 10 18 all

11 licht     39 11 25 all

12 lust     37 12 8 all

13 gesicht 36 13 22 all

14 mehr 36 13 16 all

15 wasser 35 15 9 all

16 nie 34 16 17 all

17 kommt 34 16 14 all

18 zeit 34 16 14 all

19 amerika 34 16 1 all

20 kalt 32 20 9 all


Relative frequency:

textstat_frequency(clean_rel, n=10)

feature frequency rank docfreq group

1 ich     5.4455093 1 66 all

2 du     2.3364267 2 45 all

3 liebe 0.5116736 3 11 all

4 immer 0.4783033 4 19 all

5 komm 0.4408699 5 9 all

6 sonne 0.4369036 6 15 all

7 mann 0.4266868 7 11 all

8 licht     0.4250503 8 25 all

9 gut     0.4097395 9 18 all

10 mehr 0.4051275 10 16 all




technically

Just added , with quanteda:

my_stopwords <- stopwords("de")%>%

char_remove(pattern=c("ich", "du"))

 






Commenti

Post popolari in questo blog

Till Lindemann as a Poet and with his Band (1)

Till Lindemann, lead singer of the Rammstein group, has also published some collections of poems. You should not forget: he is the son of Werner Lindemann, who used to be a prominent writer in the times of the socialist German Democratic Rebublic.  As I am (together with Claudia Lisa Moeller) translating one of these books,  "In stillen Nächten"/ "On Quiet Nights" (English translation by Ehren Fordyce,  Raw Dog Screaming Press, Bowie MD in 2025)  into Italian, I took out my R Quanteda package in order to take a more distant view on the poems.  The most frequent words ("stopwords" excluded) in Lindemann's poems are "Herz" ("heart"), which appears 33 times, and "Liebe" ("love") with 26 occurrences. This might seem similar to the frequencies in Rammstein songs, where we read the leading word "Liebe" 46 times, followed  by "Mann"/ man (45).  We might assume that, since "Herz" is often ta...

Between Goethe and Brecht. Rammstein texts and the poems of Till Lindemann

Till Lindemann is a poet better known as the song writer and lead singer of Rammstein. It  could be interesting to compare the texts band’s lyrics and his poems, maybe gaining a better  insight in how both of them are made. For automated analysis, however, both kinds of texts might seem of little interest. With very  short texts, the basis for statistical reasoning is too small. Indeed, we cannot reasonably apply  the various readability indexes we usually employ when analyzing corpora. But if we restrict  our statistical glimpse to some very elementary calculations, something similar to an  “author’s footprint” might emerge.  I will first use the TTR, the ratio between  the number of single words (“types”) that appear in the text and the total number of words  (“tokens”).Then I will have a look at the numeric relation between functional words  (prepositions articles, conjunctions, etc.) and content words. Some characteristics of Till...

Co-occurrences with keyword lists

Playing R with Rammstein texts can be fun when the outcomes are unexpected, and you get plots like Gut, better. We might even see: How do we manage making "das Gute" appear? From general to keyword plots In the plot based on all the co-occurrences  in Rammstein song texts, we can identify a center on the left-hand side: The meaningful central terms seem to be "Lust", "Deutschland", "Liebe", "ich", "du", "kalt" and "gut". Choosing these words as keys, we can plot a new picture. What do we see? There is a clear "ich" - "du" axis. "Liebe" appears together with "ich" and "du", well. But "Lust" seems to be mainly referred only to "ich", as well as "kalt".  Into the contexts Finding "kalt" surprising, even at the border of a net with "ich", "du". "Liebe", "Lust", I searched for the occ...