Search code examples
rdictionarytwitternlpquanteda

Computing relative frequencies based on dictionary


I'd like to examine the Psychological Capital (a construct consisting of four dimensions, namely hope, optimism, efficacy and resiliency) of founders using computer-aided text analysis in R. So far I have pulled tweets from various users into R. The data frame contains of 2130 tweets from 5 different users in different periods. The dataframe is called before_failure. Picture of original data frame

I have then used the quanteda package to create a corpus, perfomed tokenization on it and removed redundant punctuatio/numbers/symbols:

#Creating a corpus
before_failure_corpus <- corpus(before_failure, text_field = "text")

#Tokenization, removing punctuation and numbers
tok_before_failure <- before_failure_corpus %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_tolower()

After that I created a dictionary also using the quanteda package (The dictionary itself has been created by other authors examining Psychological capital):


#Creating Dictionary with quanteda
dict <- dictionary(list(hope = c("Accomplishments", "Achievements", "Approach", "Aspiration", "Aspire", "Aspired",
                                 "Aspirer", "Aspires", "Aspiring", "Aspiringly", "Assurance", "Assurances", "Assure",
                                 "Assured", "Assuredly", "Assuredness", "Assuring", "Assuringly", "Assuringness", "Belief",
                                 "Believe", "Believed", "Believes", "Believing", "Breakthrough", "Certain", "Certainly",
                                 "Certainty", "Committed", "Concept", "Confidence", "Confident", "Confidently",
                                 "Convinced", "Dare say", "Deduce", "Deduced", "Deduces", "Deducing", "Desire",
                                 "Desired", "Desires", "Desiring", "Doubt not", "Energy", "Engage", "Engagement",
                                 "Expectancy", "Faith", "Foresaw", "Foresee", "Foreseeing", "Foreseen", "Foresees", "Goal",
                                 "Goals", "Hearten", "Heartened", "Heartening", "Hearteningly", "Heartens", "Hope",
                                 "Hoped", "Hopeful", "Hopefully", "Hopefulness", "Hoper", "Hopes", "Hoping", "Idea",
                                 "Innovation", "Innovative", "Ongoing", "Opportunity", "Promise", "Promising",
                                 "Propitious", "Propitiously", "Propitiousness", "Solution", "Solutions", "Upbeat",
                                 "Wishes", "Wishing", "Yearn", "Yearn for", "Yearning", "Yearning for", "Yearns for"),
                       efficacy = c("Ability", "Accomplish", "Accomplished", "Accomplishes", "Accomplishing",
                                    "Accomplishments", "Achievements", "Achieving", "Adept", "Adeptly", "Adeptness",
                                    "Adroitly", "Adroitness", "All-in", "Aplomb", "Arrogance", "Arrogant", "Arrogantly",
                                    "Assurance", "Assured", "Assuredly", "Assuredness", "Backbone", "Bandwidth", "Belief",
                                    "Capable", "Capableness", "Capably", "Certain", "Certainly", "Certainness", "Certainty",
                                    "Certitude", "Cocksurely", "Cocksureness", "Cocky", "Commitment", "Commitments",
                                    "Committed", "Compelling", "Competence", "Competency", "Competent", "Competently",
                                    "Confidence", "Confident", "Confidently", "Conviction", "Effective", "Effectively",
                                    "Effectiveness", "Effectual", "Effectually", "Effectualness", "Efficacious", "Efficaciously",
                                    "Efficaciousness", "Efficacy", "Equanimity", "Equanimous", "Equanimously", "Expertise",
                                    "Expertly", "Fortitude", "Fortitudinous", "Forward", "Forwardness", "Know-how",
                                    "Knowledgability", "Knowledgeable", "Knowledgably", "Masterful", "Masterfully", "Masterfulness",
                                    "Masterly", "Mastery", "Overconfidence", "Overconfident", "Overconfidently",
                                    "Persuasion", "Power", "Powerful", "Powerfully", "Powerfulness", "Prevailed",
                                    "Prevailing", "Prevails", "Prevalence", "Prevalent", "Reassurance", "Reassure", "Reassured",
                                    "Reassures", "Reassuring", "Self-assurance", "Self-assured", "Self-assuring", "Selfconfidence",
                                    "Self-confident", "Self-dependence", "Self-dependent", "Self-reliance",
                                    "Self-reliant", "Stamina", "Steadily", "Steadiness", "Steady", "Strength", "Strong", "Stronger",
                                    "Strongish", "Strongly", "Strongness", "Superior", "Superiority", "Sure", "Surely", "Sureness",
                                    "Unblinking", "Unblinkingly", "Undoubtedly", "Undoubting", "Unflappability", "Unflappable",
                                    "Unflinching", "Unflinchingly", "Unhesitating", "Unhesitatingly", "Unwavering",
                                    "Unwaveringly"),
                       resiliency = c("Adamant", "Adamantly", "Assiduous", "Assiduously", "Assiduousness", "Backbone",
                                      "Bandwidth", "Bears up", "Bounce", "Bounced", "Bounces", "Bouncing", "Buoyant",
                                      "Commitment", "Commitments", "Committed", "Consistent", "Determination",
                                      "Determined", "Determinedly", "Determinedness", "Devoted", "Devotedly",
                                      "Devotedness", "Devotion", "Die trying", "Died trying", "Dies trying", "Disciplined",
                                      "Dogged", "Doggedly", "Doggedness", "Drudge", "Drudged", "Drudges", "Endurance",
                                      "Endure", "Endured", "Endures", "Enduring", "Grit", "Hammer away", "Hammered away",
                                      "Hammering away", "Hammers away", "Held fast", "Held good", "Held up", "Hold fast",
                                      "Holding fast", "Holding up", "Holds fast", "Holds good", "Immovability", "Immovable",
                                      "Immovably", "Indefatigable", "Indefatigableness", "Indefatigably", "Indestructibility",
                                      "Indestructible", "Indestructibleness", "Indestructibly", "Intransigence", "Intransigency",
                                      "Intransigent", "Keep at", "Keep going", "Keep on", "Keeping at", "Keeping going",
                                      "Keeping on", "Keeps at", "Keeps going", "Keeps on", "Kept at", "Kept going", "Kept on",
                                      "Labored", "Laboring", "Never-tiring", "Never-wearying", "Perdure", "Perdured", "Perduring",
                                      "Perseverance", "Persevere", "Persevered", "Persevering", "Persist", "Persisted",
                                      "Persistence", "Persistent", "Persisting", "Pertinacious", "Pertinaciously", "Pertinacity",
                                      "Rebound", "Rebounded", "Rebounding", "Rebounds", "Relentlessness", "Remain",
                                      "Remained", "Remaining", "Remains", "Resilience", "Resiliency", "Resilient", "Resolute",
                                      "Resolutely", "Resoluteness", "Resolve", "Resolved", "Resolves", "Resolving", "Robust",
                                      "Sedulity", "Sedulous", "Sedulously", "Sedulousness", "Snap back", "Snapped back",
                                      "Snapping back", "Snaps back", "Spring back", "Springing back", "Springs", "Springs back",
                                      "Sprung back", "Stalwart", "Stalwartly", "Stalwartness", "Stand fast", "Stand firm", "Standingfast",
                                      "Standing firm", "Stands fast", "Stands firm", "Stay", "Steadfast", "Steadfastly",
                                      "Steadfastness", "Stood fast", "Stood firm", "Strove", "Survive", "Surviving", "Surviving",
                                      "Tenacious", "Tenaciously", "Tenaciousness", "Tenacity", "Tough", "Uncompromising",
                                      "Uncompromisingly", "Uncompromisingness", "Unfaltering", "Unfalteringly", "Unflagging",
                                      "Unrelenting", "Unrelentingly", "Unrelentingness", "Unshakable", "Unshakablely",
                                      "Unshakeable", "Unshaken", "Unshaking", "Unswervable", "Unswerved", "Unswerving",
                                      "Unswervingly", "Unswervingness", "Untiring", "Unwavered", "Unwavering", "Unweariedness",
                                      "Unyielding", "Unyieldingly", "Unyieldingness", "Upheld", "Uphold", "Upholding",
                                      "Upholds", "Zeal", "Zealous", "Zealously", "Zealousness"),
                       optimism = c("Aspire", "Aspirer", "Aspires", "Aspiring", "Aspiringly", "Assurance", "Assured", "Assuredly",
                                    "Assuredness", "Assuring", "Auspicious", "Auspiciously", "Auspiciousness", "Bank on",
                                    "Beamish", "Believe", "Believed", "Believes", "Believing", "Bullish", "Bullishly", "Bullishness",
                                    "Confidence", "Confident", "Confidently", "Encourage", "Encouraged", "Encourages",
                                    "Encouraging", "Encouragingly", "Ensuring", "Expectancy", "Expectant", "Expectation",
                                    "Expectations", "Expected", "Expecting", "Faith", "Good omen", "Hearten", "Heartened",
                                    "Heartener", "Heartening", "Hearteningly", "Heartens", "Hope", "Hoped", "Hopeful",
                                    "Hopefully", "Hopefulness", "Hoper", "Hopes", "Hoping", "Ideal", "Idealist", "Idealistic",
                                    "Idealistically", "Ideally", "Looking up", "Looks up", "Optimism", "Optimist", "Optimistic",
                                    "Optimistical", "Optimistically", "Outlook", "Positive", "Positively", "Positiveness",
                                    "Positivity", "Promising", "Propitious", "Propitiously", "Propitiousness", "Reassure",
                                    "Reassured", "Reassures", "Reassuring", "Roseate", "Rosy", "Sanguine", "Sanguinely",
                                    "Sanguineness", "Sanguinity", "Sunniness", "Sunny")))

Now i would like to compute the relative frequency by dividing the number of words used in the tweets that reflect the four dimensions of Psycap trough the total number of words in the corpus. Unfortunately I got stuck at this point. In the end I would like to have a table that looks like this (values are made up):

 dimensions Frequency
1       hope      0.36
2   optimism      0.50
3   Efficacy      0.22
4 Resiliency      0.10

I hope my explanations are sufficient, if not do not hesitate to ask. Thank you


Solution

  • The easiest way to do this is to use tokens_lookup() with a category for tokens not matched, then to compile this into a dfm that you then convert to term proportions within document.

    To use a reproducible example from built-in quanteda objects, the process would be the following. (You can substitute your own corpus and dictionary and the code should work fine.)

    library("quanteda")
    ## Package version: 3.2
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    tok_before_failure <- tokens(tail(data_corpus_inaugural, 5))
    dict <- data_dictionary_LSD2015[1:2]
    
    tokens_lookup(tok_before_failure, data_dictionary_LSD2015[1:2], nomatch = "other") %>%
      dfm() %>%
      dfm_weight(scheme = "prop")
    ## Document-feature matrix of: 5 documents, 3 features (0.00% sparse) and 4 docvars.
    ##             features
    ## docs           negative   positive     other
    ##   2005-Bush  0.03719723 0.09169550 0.8711073
    ##   2009-Obama 0.04428731 0.07182732 0.8838854
    ##   2013-Obama 0.03366422 0.07337074 0.8929650
    ##   2017-Trump 0.02831325 0.07409639 0.8975904
    ##   2021-Biden 0.04049168 0.06182213 0.8976862