The site Footnote 2 was used as a way to gather tweet-ids Footnote step three , this website provides boffins that have metadata from good (third-party-collected) corpus regarding Dutch tweets (Tjong Kim Done and you will Van den Bosch, 2013). elizabeth., the fresh new historic limit whenever requesting tweets considering a journey inquire). The Roentgen-plan ‘rtweet’ and you will subservient ‘lookup_status’ function were utilized to get tweets during the JSON structure. The new JSON document comprises a desk for the tweets’ pointers, like the creation day, brand new tweet text, and also the source (we.e., brand of Twitter visitors).
Investigation cleaning and you can preprocessing
The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed San Francisco CA sugar daddy. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as users who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
The new tweet texts were transformed into ASCII encoding. URLs, range holidays, tweet headers, display brands, and you will recommendations to display screen brands was basically removed. URLs increase the reputation number when found into the tweet. not, URLs do not enhance the profile number while they are located at the conclusion good tweet. To eliminate a misrepresentation of the genuine character limit one to profiles had to deal with, tweets that have URLs (yet not news URLs for example extra photo otherwise video clips) were omitted.
Token and you can bigram research
The fresh Roentgen bundle Footnote 5 ‘quanteda’ was used to help you tokenize the fresh new tweet messages toward tokens (we.e., isolated conditions, punctuation s. Likewise, token-frequency-matrices have been computed having: brand new frequency pre-CLC [f(token pre)], the cousin volume pre-CLC[P (token pre)], the frequency article-CLC [f(token article)], new relative frequency post-CLC and T-ratings. This new T-shot is much like a simple T-figure and you can calculates brand new analytical difference between setting (we.age., the fresh relative phrase frequencies). Bad T-results suggest a relatively highest density away from a token pre-CLC, while self-confident T-results imply a somewhat large thickness of an effective token article-CLC. Brand new T-rating picture used in the research try shown because the Eq. (1) and you can (2). N is the total number from tokens for each and every dataset (i.elizabeth., before and after-CLC). So it formula will be based upon the method to own linguistic calculations of the Chapel mais aussi al. (1991; Tjong Kim Carried out, 2011).
Part-of-speech (POS) research
The fresh R bundle Footnote 6 ‘openNLP’ was applied so you’re able to classify and you will amount POS categories throughout the tweets (we.age., adjectives, adverbs, stuff, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and various). New POS tagger operates having fun with a max entropy (maxent) chances model so you’re able to assume the brand new POS group according to contextual provides (Ratnaparkhi, 1996). The new Dutch maxent design employed for brand new POS category was coached to your CoNLL-X Alpino Dutch Treebank study (Buchholz and you can ). The latest openNLP POS design could have been claimed that have a precision get regarding 87.3% whenever used in English social networking investigation (Horsmann ainsi que al., 2015). An ostensible limit of latest studies ‘s the precision out-of the POS tagger. Yet not, equivalent analyses have been did both for pre-CLC and article-CLC datasets, definition the accuracy of your own POS tagger might be consistent over both datasets. Hence, i guess there aren’t any scientific confounds.