coherence & proposition

measuring reference distance in :schizophrenia: threads

st. schwarz

2025-10-22

github repo for scripts & alii

Schwarz (2025)

data

  • target corpus: subreddit channel (n =1500371 tokens)
  • reference corpus: r/unpopularopinion (n =980731 tokens)
  • pos-tagged using R udpipe package (Wijffels (2023)) (universal dependencies tagset)
  • dataframe of 142321 distance datapoints
  • the distances are normalised to the target corpus
  • outliers are excluded from the analysis

sample subset

token upos target pos prepos url_id range q det aut_id total_mentions dist embed.score dist_rel_within dist_rel_all dist_rel_obs dist_rel_ref embed_c dist_rel_scaled
police NOUN obs 192087 DET 370 2071 c 1 810 6 132 40.79 121 204 121 286 -1.51 0.06
case NOUN ref 818899 ADJ 2243 6849 a 0 10059 7 7 29.89 5 3 2 5 -12.41 0.00
days NOUN ref 332152 VERB 1989 5726 a 0 4577 62 24 53.05 19 13 8 19 10.75 0.00
fighting NOUN ref 655504 PART 2140 2599 a 0 4195 2 35 38.23 60 43 26 60 -4.07 0.01
number NOUN ref 101561 SCONJ 1899 3001 a 0 4469 71 11 49.23 16 12 7 16 6.92 0.00
defense NOUN ref 309218 PUNCT 1987 12593 a 0 6279 6 37 35.90 13 9 6 13 -6.41 0.00
Religion NOUN obs 941735 PUNCT 1462 2277 a 0 2590 4 600 57.32 501 844 501 1182 15.01 0.26
schizophrenia NOUN obs 453902 DET 695 1702 c 1 1459 8 160 40.20 179 301 179 422 -2.10 0.09
way NOUN obs 1010964 DET 1548 926 d 1 2722 2 607 30.20 1247 2101 1247 2940 -12.11 0.66
artist NOUN ref 56247 PRON 1881 2625 a 0 3968 13 23 36.39 39 28 17 39 -5.92 0.01

general

compare distances by corpus, normalised to obs, distance ceiling =  outliers removed

compare distances by corpus, normalised to obs, distance ceiling = outliers removed

mean

mean distances over query/corpus, normalised to obs, distance ceiling =  outliers removed

mean distances over query/corpus, normalised to obs, distance ceiling = outliers removed

mean/median

table (normalised) for model: 1
target q n mean median
obs a 42836 234 117
ref a 58615 121 47
obs b 2116 286 165
ref b 1130 121 44
obs c 5770 231 114
ref c 1274 120 48
obs d 5654 260 144
ref d 1525 122 49
obs e 3911 281 147
ref e 671 125 45
obs f 2311 222 133
ref f 413 116 47

median

median distances over query/corpus, normalised to obs, distance ceiling =  outliers removed

median distances over query/corpus, normalised to obs, distance ceiling = outliers removed

estimates

distances relation, normalised to obs, distance ceiling =  outliers removed

distances relation, normalised to obs, distance ceiling = outliers removed

normalised vs. raw

distances normalised vs. raw

distances normalised vs. raw

why normalise

mean/median table (raw) for model: 2
target q n mean median
obs a 46318 525 92
ref a 68618 2305 118
obs b 2287 275 109
ref b 1315 1771 111
obs c 6253 666 89
ref c 1504 1147 119
obs d 6171 441 105
ref d 1765 2214 124
obs e 4278 298 109
ref e 795 2636 116
obs f 2520 249 77
ref f 497 1627 124

REF

Schwarz, St. 2025. “Poster Appendix: This Papers Scripts for Corpus Build and Statistics on Github.” https://github.com/esteeschwarz/SPUND-LX/tree/main/psych/HA.
Wijffels, Jan. 2023. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’UDPipe’ ’NLPToolkit. https://CRAN.R-project.org/package=udpipe.