coherence & proposition

github repo for scripts & alii

Schwarz (2025)

data

target corpus: subreddit channel (n =1500371 tokens)
reference corpus: r/unpopularopinion (n =980731 tokens)
pos-tagged using R udpipe package (Wijffels (2023)) (universal dependencies tagset)
dataframe of 142321 distance datapoints
the distances are normalised to the target corpus
outliers are excluded from the analysis

sample subset

token	upos	target	pos	prepos	url_id	range	q	det	aut_id	total_mentions	dist	embed.score	dist_rel_within	dist_rel_all	dist_rel_obs	dist_rel_ref	embed_c	dist_rel_scaled
police	NOUN	obs	192087	DET	370	2071	c	1	810	6	132	40.79	121	204	121	286	-1.51	0.06
case	NOUN	ref	818899	ADJ	2243	6849	a	0	10059	7	7	29.89	5	3	2	5	-12.41	0.00
days	NOUN	ref	332152	VERB	1989	5726	a	0	4577	62	24	53.05	19	13	8	19	10.75	0.00
fighting	NOUN	ref	655504	PART	2140	2599	a	0	4195	2	35	38.23	60	43	26	60	-4.07	0.01
number	NOUN	ref	101561	SCONJ	1899	3001	a	0	4469	71	11	49.23	16	12	7	16	6.92	0.00
defense	NOUN	ref	309218	PUNCT	1987	12593	a	0	6279	6	37	35.90	13	9	6	13	-6.41	0.00
Religion	NOUN	obs	941735	PUNCT	1462	2277	a	0	2590	4	600	57.32	501	844	501	1182	15.01	0.26
schizophrenia	NOUN	obs	453902	DET	695	1702	c	1	1459	8	160	40.20	179	301	179	422	-2.10	0.09
way	NOUN	obs	1010964	DET	1548	926	d	1	2722	2	607	30.20	1247	2101	1247	2940	-12.11	0.66
artist	NOUN	ref	56247	PRON	1881	2625	a	0	3968	13	23	36.39	39	28	17	39	-5.92	0.01

general

compare distances by corpus, normalised to obs, distance ceiling = outliers removed

mean

mean distances over query/corpus, normalised to obs, distance ceiling = outliers removed

mean/median

table (normalised) for model: 1
target	q	n	mean	median
obs	a	42836	234	117
ref	a	58615	121	47
obs	b	2116	286	165
ref	b	1130	121	44
obs	c	5770	231	114
ref	c	1274	120	48
obs	d	5654	260	144
ref	d	1525	122	49
obs	e	3911	281	147
ref	e	671	125	45
obs	f	2311	222	133
ref	f	413	116	47

median

median distances over query/corpus, normalised to obs, distance ceiling = outliers removed

estimates

distances relation, normalised to obs, distance ceiling = outliers removed

normalised vs. raw

distances normalised vs. raw

why normalise

mean/median table (raw) for model: 2
target	q	n	mean	median
obs	a	46318	525	92
ref	a	68618	2305	118
obs	b	2287	275	109
ref	b	1315	1771	111
obs	c	6253	666	89
ref	c	1504	1147	119
obs	d	6171	441	105
ref	d	1765	2214	124
obs	e	4278	298	109
ref	e	795	2636	116
obs	f	2520	249	77
ref	f	497	1627	124

REF

Schwarz, St. 2025. “Poster Appendix: This Papers Scripts for Corpus Build and Statistics on Github.” https://github.com/esteeschwarz/SPUND-LX/tree/main/psych/HA.

Wijffels, Jan. 2023. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’UDPipe’ ’NLP’ Toolkit. https://CRAN.R-project.org/package=udpipe.