| dataset | group | mean_clip |
|---|---|---|
| 1 | hc | 0.2213 |
| 1 | scz | 0.2398 |
| 1 | DIFF: | -0.0185 |
SMI paper
1 index
1.1 snc
16065.1.2
2 the einleitung
inspired by the paper […], (ref?), who found evidence for […] we tried to replicate the pipeline of […].
Our first draft essai proves their hypothesis […].
3 background
3.1 preliminary
- embedding into class subject
- […]
3.2 literature
3.3 hypothesis
[…]
4 material/methods
please cf. Schwarz (2026#todo) for the corpus building and evaluation scripts
4.1 data
our corpus consists of interview transcripts available in raw text which were tokenised and segmented into 10-unit chunks for further processing, Section 4.2.
4.1.1 corpus stats
| dataset | group | participants | tokens |
|---|---|---|---|
| 1 | HC | 13 | 21650 |
| 1 | SCZ | 16 | 92250 |
4.2 computations
the computation of clip scores was done using HU resources with a python script cf. Nenchev (2026)
5 results
5.1 basic descriptive
first simple descriptive stats insights which are to test for stability…
5.2 visuals
[…]
raffael script (commit hash!) from ggl-doc: Richter (2026)

5.3 linear regression
to prove descriptive results, we compute the stability […] with a linear regression model using R’s lme4::lmer() function, cf. Bates et al. (2015). coefficents are printed below […]
5.3.1 basic (lm)
formula: clipscore ~ group
Call:
lm(formula = cl_score ~ group, data = dff4)
Residuals:
Min 1Q Median 3Q Max
-0.21020 -0.04017 -0.00256 0.03779 0.27403
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2213234 0.0003823 578.94 <2e-16 ***
groupscz 0.0184706 0.0004248 43.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.05625 on 113898 degrees of freedom
Multiple R-squared: 0.01633, Adjusted R-squared: 0.01632
F-statistic: 1891 on 1 and 113898 DF, p-value: < 2.2e-16
5.3.2 mixed effects model (lmer)
formula: clipscore ~ group + (1|TN)+(1|text_chunk)+ld+fstPPr_rate < positing random effects for participant and token and fixed effects for type/token ratio (lexical diversity factor) and 1st-person-pronoun rate.
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: cl_score ~ group + (1 | TN) + (1 | text_chunk) + ld + fstPPr_rate
Data: dff4
REML criterion at convergence: -406293.2
Scaled residuals:
Min 1Q Median 3Q Max
-4.7324 -0.5891 -0.0624 0.4691 6.2275
Random effects:
Groups Name Variance Std.Dev.
text_chunk (Intercept) 1.631e-03 0.040381
TN (Intercept) 7.633e-05 0.008737
Residual 1.476e-03 0.038419
Number of obs: 113900, groups: text_chunk, 3653; TN, 29
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.234962 0.010406 27.942796 22.579 < 2e-16 ***
groupscz 0.023571 0.004146 28.049269 5.685 4.26e-06 ***
ld -0.026691 0.034733 36.448483 -0.768 0.447
fstPPr_rate -0.189069 0.122592 24.946504 -1.542 0.136
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) grpscz ld
groupscz -0.217
ld -0.731 0.280
fstPPr_rate -0.696 -0.261 0.105
5.3.2.1 helper interpretation, to be tested
the coefficients interesting for us are