Analyzing Mentions of Death in Covid-19 Tweets
DOI10.5281/zenodo.10839649Zenodo10839649MaRDI QIDQ6708237
Dataset published at Zenodo repository.
Author name not available (Why is that?)
Publication date: 27 March 2024
Dataset preparation and annotation The dataset is a subset of the TBCOV dataset collected at QCRI filtered for mentions of personally related COVID-19 deaths. The filtering was done using regular expressions such as my * passed, my * died, my * succumbed lost * battle. A sample of the dataset was annotated on Appen. Please see 'annotation-instructions.txt' for the full instructions provided to the annotators. Dataset description The "classifier_filtered_english.csv" file contains 33k deduplicated and classifier-filtered tweets (following X's content redistribution policy). for the 6 countries (Australia, Canada, India, Italy, United Kingdom, and United States) from March 2020 to March 2021 with classifier-labeled death labels, regular expression-filtered gender and relationship labels, and the user device label. The full 57k regex-filtered collection of tweets can be made available on special cases for Academics and Researchers. date: the date of the tweet country_name: the country name from Nominatim API tweet_id: the ID of the tweet url: the full URL of the tweet full_text: the full-text content of the tweet (also includes the URL of any media attached) does_the_tweet_refer_to_the_covidrelated_death_of_one_or_more_individuals_personally_known_to_the_tweets_author: the classifier predicted label for the death (also includes the original labels for the annotated samples) what_is_the_relationship_between_the_tweets_author_and_the_victim_mentioned: the annotated relationship labels relative_to_the_time_of_the_tweet_when_did_the_mentioned_death_occur: the annotated relative time labels user_is_verified: if the user is verified or not user_gender: the gender of the Twitter user (from the user profile) user_device: the Twitter client the user uses has_media: if the tweet has any attached media has_url: if the tweet text contains a URL matched_device: the device (Apple or Android) based on the Twitter client regex_gender: the gender inferred from regular expression-based filtering regex_relationship: the relationship label from regular expression-based filtering Inferring gender using regular expressions We first determine the mapping between different relationship labels mentioned in the tweet to the gender. We do not use any relationship like "cousin" from which we cannot easily infer the gender. Male relationships: 'father', 'dad', 'daddy', 'papa', 'pop', 'pa', 'son', 'brother', 'uncle', 'nephew', 'grandfather', 'grandpa', 'gramps', 'husband', 'boyfriend', 'fianc', 'groom', 'partner', 'beau', 'friend', 'buddy', 'pal', 'mate', 'companion', 'boy', 'gentleman', 'man', 'father-in-law', 'brother-in-law', 'stepfather', 'stepbrother' Female relationships: 'mother', 'mom', 'mama', 'mum', 'ma', 'daughter', 'sister', 'aunt', 'niece', 'grandmother', 'grandma', 'granny', 'wife', 'girlfriend', 'fiance', 'bride', 'partner', 'girl', 'lady', 'woman', 'miss', 'mother-in-law', 'sister-in-law', 'stepmother', 'stepsister' Based on these mappings, we used the following regex for each gender label to determine the gender of the deceased mentioned in the tweet. "[m|M]y\s(" + "|".join([r + "s?" for r in relationships]) + ")\s(died|succumbed|deceased)" Age groups from relationship labels First, we get the relationship labels using regex filtering, and then we group them into different age-group categories as shown in the following table. The UK and the US use different age groups because of the different age group definitions in the official data. Category Relationship (from tweets) Age Group (UK) Age Group (US) Grandparents grandfather, grandmother 65+ 65+ Parents father, mother, uncle, aunt 45-64 35-64 Siblings brother, sister, cousin 15-44 15-34 Children son, daughter, nephew, niece 0-14 0-14 Training the classifier The 'english-training.csv' file contains about 13k deduplicated human-annotated tweets. We use a random seed (42) to create the train/test split. The model Covid-Bert-V2 was fine-tuned on the training set for 2 epochs with the following hyperparameters (obtained using 10-fold CV): random_seed: 42, batch_size: 32, dropout: 0.1. We obtained a F1-score of 0.81 on the test set. We used about 5% (671) of the combined and deduplicated annotated tweets as the test set, about 2% (255) as the validation set, and the remaining 12,494 tweets were used for fine-tuning the model. The tweets were preprocessed to replace mentions, URLs, emojis, etc with generic keywords. The model was trained on a system with a single Nvidia A4000 16GB GPU. The fine-tuned model is also available as the 'model.bin' file. The code for finetuning the model as well as reproducing the experiments are available in this GitHub repository. Datasheet We also include a datasheet for the dataset following the recommendation of "Datasheets for Datasets" (Gebru et. al.) which provides more information about how the dataset was created and how it can be used. Please see "Datasheet.pdf". NOTE: We recommend that researchers try to rehydrate the individual tweets to ensure that the user has not deleted the tweet since posting. This gives users a mechanism to opt out of having their data analyzed. Please only use your institutional email when requesting the dataset as anything else (like gmail.com) will be rejected. The dataset will only be made available on reasonable request for Academics and Researchers. Please mention why you need the dataset and how you plan to use the dataset when making a request.
This page was built for dataset: Analyzing Mentions of Death in Covid-19 Tweets