3 appendix

overall wordcount of paper: 3071.

3.1 legende

Table 3.1: model vars
variable explanation values
target corpus obs,ref
q condition a,b,c,d,e,f
det antecedent POS==DET TRUE,FALSE
aut_id author author hash
lemma lemma noun lemma
range url range of distance devised 1..maxlength(urlthread)
embed.score semantic similarity score lemma vs. thread 0..1
q:a query condition .*
q:b query condition this,that,those,these
q:c query condition the
q:d query condition a,an,any,some
q:e query condition my
q:f query condition his,her,their,your

3.2 evaluation model: 1

3.2.1 meta

eval output data: 13, normalised to obs, distance ceiling = outliers removed

3.2.2 parameter setting

##                value
## norm_target _rel_obs
## det.t           TRUE
## limit           TRUE
## author          TRUE
## url             TRUE
## embed1          TRUE
## embed2             f
## range1          TRUE
## range2             f
## rel             TRUE
## lme            FALSE
## lemma          FALSE

3.2.3 anova analysis

3.2.3.1 anova plain

formula: [dist_rel_obs ~ target*q*det]

##                  Df     Sum Sq   Mean Sq   F value    Pr(>F)    
## target            1  452303747 452303747 7336.4625 < 2.2e-16 ***
## q                 5   12320667   2464133   39.9688 < 2.2e-16 ***
## det               1    1636109   1636109   26.5380 2.588e-07 ***
## target:q          5    2747371    549474    8.9126 1.786e-08 ***
## target:det        1     251297    251297    4.0761  0.043496 *  
## q:det             2     905292    452646    7.3420  0.000648 ***
## target:q:det      1     717222    717222   11.6335  0.000648 ***
## Residuals    126209 7780971239     61651                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.2.3.2 anova of linear regression model

[anova(summary(lmer))]

## Type III Analysis of Variance Table with Satterthwaite's method
##                Sum Sq  Mean Sq NumDF  DenDF   F value    Pr(>F)    
## target        1144219  1144219     1   3519   23.4567 1.333e-06 ***
## q              737483   147497     5 122421    3.0237 0.0098706 ** 
## det             12165    12165     1 118425    0.2494 0.6175055    
## range        50399647 50399647     1   1025 1033.2042 < 2.2e-16 ***
## embed.score  25101881 25101881     1 122690  514.5942 < 2.2e-16 ***
## target:q       776335   155267     5 123486    3.1830 0.0070933 ** 
## target:det     541078   541078     1 123325   11.0922 0.0008672 ***
## q:det          359520   179760     2 120804    3.6851 0.0250971 *  
## target:q:det   219844   219844     1 123315    4.5068 0.0337615 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.2.3.3 linear regression coefficients

formula: [dist_rel_obs ~ target*q*det+(1|aut_id)+range+(embed.score)+(1|url_id)]

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: eval(expr(lmeform))
##    Data: dfa
## 
## REML criterion at convergence: 1727639
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8643 -0.5282 -0.1721  0.2469  6.9244 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  aut_id   (Intercept)  2856     53.44  
##  url_id   (Intercept)  8187     90.48  
##  Residual             48780    220.86  
## Number of obs: 126226, groups:  aut_id, 8238; url_id, 2145
## 
## Fixed effects:
##                    Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)       4.625e+02  5.159e+00  8.969e+03  89.651  < 2e-16 ***
## targetref        -4.342e+01  6.299e+00  1.300e+03  -6.893 8.50e-12 ***
## qb               -2.013e+01  1.527e+01  1.218e+05  -1.318 0.187483    
## qc               -2.207e+01  5.499e+00  1.226e+05  -4.014 5.98e-05 ***
## qd               -3.178e+01  2.225e+02  1.184e+05  -0.143 0.886426    
## qe                2.492e+01  3.835e+00  1.247e+05   6.498 8.14e-11 ***
## qf               -1.891e+01  4.892e+00  1.244e+05  -3.866 0.000111 ***
## det               1.273e+01  4.774e+00  1.229e+05   2.667 0.007662 ** 
## range            -5.810e-02  1.808e-03  1.025e+03 -32.143  < 2e-16 ***
## embed.score      -1.829e+02  8.061e+00  1.227e+05 -22.685  < 2e-16 ***
## targetref:qb      1.862e+01  1.719e+01  1.225e+05   1.083 0.278599    
## targetref:qc      2.281e+01  1.279e+01  1.237e+05   1.784 0.074435 .  
## targetref:qd      4.413e-01  1.254e+01  1.238e+05   0.035 0.971935    
## targetref:qe     -2.321e+01  9.511e+00  1.239e+05  -2.441 0.014662 *  
## targetref:qf      1.801e+01  1.210e+01  1.238e+05   1.488 0.136766    
## targetref:det    -1.478e+01  1.084e+01  1.239e+05  -1.363 0.172784    
## qb:det            5.915e+01  1.678e+01  1.219e+05   3.526 0.000423 ***
## qd:det            3.648e+01  2.225e+02  1.184e+05   0.164 0.869736    
## targetref:qb:det -5.198e+01  2.448e+01  1.233e+05  -2.123 0.033761 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients
## Some predictor variables are on very different scales: consider rescaling

3.2.4 plots

compare distances by corpus, normalised to obs, distance ceiling =  outliers removed

Figure 3.1: compare distances by corpus, normalised to obs, distance ceiling = outliers removed

mean distances over query/corpus, normalised to obs, distance ceiling =  outliers removed

Figure 3.2: mean distances over query/corpus, normalised to obs, distance ceiling = outliers removed

Table 3.2: mean/median table for model: 1
target q n mean median
obs a 42836 234 117
ref a 58615 121 47
obs b 2116 286 165
ref b 1130 121 44
obs c 5770 231 114
ref c 1274 120 48
obs d 5654 260 144
ref d 1525 122 49
obs e 3911 281 147
ref e 671 125 45
obs f 2311 222 133
ref f 413 116 47
median distances over query/corpus, normalised to obs, distance ceiling =  outliers removed

Figure 3.3: median distances over query/corpus, normalised to obs, distance ceiling = outliers removed

distances relation, normalised to obs, distance ceiling =  outliers removed

Figure 3.4: distances relation, normalised to obs, distance ceiling = outliers removed

distances normalised vs. raw

Figure 3.5: distances normalised vs. raw

3.3 evaluation model: 2

3.3.1 meta

eval output data: 13, not normalised, distance ceiling =outliers not removed

3.3.2 parameter setting

##             value
## norm_target      
## det.t        TRUE
## limit       FALSE
## author       TRUE
## url          TRUE
## embed1       TRUE
## embed2          f
## range1       TRUE
## range2          f
## rel         FALSE
## lme         FALSE
## lemma       FALSE

3.3.3 anova analysis

3.3.3.1 anova plain

formula: [dist ~ target*q*det]

##                  Df     Sum Sq    Mean Sq  F value  Pr(>F)    
## target            1 1.1152e+11 1.1152e+11 268.8154 < 2e-16 ***
## q                 5 9.8792e+08 1.9758e+08   0.4763 0.79425    
## det               1 4.1537e+08 4.1537e+08   1.0012 0.31702    
## target:q          5 2.3050e+09 4.6101e+08   1.1112 0.35184    
## target:det        1 2.7199e+09 2.7199e+09   6.5561 0.01045 *  
## q:det             2 2.4028e+08 1.2014e+08   0.2896 0.74857    
## target:q:det      1 7.0024e+06 7.0024e+06   0.0169 0.89663    
## Residuals    142304 5.9037e+13 4.1487e+08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.3.3.2 anova of linear regression model

[anova(summary(lmer))]

## Type III Analysis of Variance Table with Satterthwaite's method
##                  Sum Sq    Mean Sq NumDF  DenDF  F value  Pr(>F)    
## target       1.2717e+09 1.2717e+09     1   3751   5.5781 0.01824 *  
## q            6.3534e+08 1.2707e+08     5 137654   0.5574 0.73281    
## det          7.3359e+05 7.3359e+05     1 133172   0.0032 0.95476    
## range        2.8637e+07 2.8637e+07     1   2113   0.1256 0.72306    
## embed.score  2.7199e+10 2.7199e+10     1 141732 119.3005 < 2e-16 ***
## target:q     3.0753e+09 6.1507e+08     5 138840   2.6979 0.01920 *  
## target:det   8.1028e+08 8.1028e+08     1 138434   3.5541 0.05940 .  
## q:det        4.8717e+08 2.4358e+08     2 135770   1.0684 0.34355    
## target:q:det 2.4585e+06 2.4585e+06     1 138496   0.0108 0.91729    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.3.3.3 linear regression coefficients

formula: [dist ~ target*q*det+(1|aut_id)+range+(embed.score)+(1|url_id)]

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: eval(expr(lmeform))
##    Data: dfa
## 
## REML criterion at convergence: 3153644
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -23.760  -0.034  -0.006   0.025  55.672 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  aut_id   (Intercept)  28985985  5384   
##  url_id   (Intercept)  98381104  9919   
##  Residual             227983636 15099   
## Number of obs: 142321, groups:  aut_id, 8395; url_id, 2145
## 
## Fixed effects:
##                    Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)       2.873e+03  4.211e+02  8.594e+03   6.823 9.53e-12 ***
## targetref         1.341e+03  6.536e+02  2.412e+03   2.051   0.0404 *  
## qb                6.895e+01  1.008e+03  1.363e+05   0.068   0.9454    
## qc               -6.307e+02  3.622e+02  1.372e+05  -1.741   0.0816 .  
## qd               -1.993e+03  1.522e+04  1.332e+05  -0.131   0.8958    
## qe               -1.006e+02  2.520e+02  1.385e+05  -0.399   0.6899    
## qf               -1.355e+02  3.218e+02  1.384e+05  -0.421   0.6737    
## det               7.031e+02  3.145e+02  1.375e+05   2.236   0.0254 *  
## range             6.798e-02  1.918e-01  2.113e+03   0.354   0.7231    
## embed.score      -5.793e+03  5.304e+02  1.417e+05 -10.922  < 2e-16 ***
## targetref:qb      6.675e+02  1.124e+03  1.371e+05   0.594   0.5527    
## targetref:qc      3.752e+01  8.128e+02  1.395e+05   0.046   0.9632    
## targetref:qd      2.022e+03  7.989e+02  1.395e+05   2.531   0.0114 *  
## targetref:qe      2.269e+02  6.042e+02  1.395e+05   0.376   0.7073    
## targetref:qf      3.210e+02  7.643e+02  1.393e+05   0.420   0.6745    
## targetref:det    -1.416e+03  6.890e+02  1.397e+05  -2.055   0.0398 *  
## qb:det           -1.077e+03  1.107e+03  1.364e+05  -0.973   0.3304    
## qd:det            1.039e+03  1.521e+04  1.332e+05   0.068   0.9456    
## targetref:qb:det -1.651e+02  1.590e+03  1.385e+05  -0.104   0.9173    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients
## Some predictor variables are on very different scales: consider rescaling

3.3.4 plots

compare distances by corpus, not normalised, distance ceiling =outliers not removed

Figure 3.6: compare distances by corpus, not normalised, distance ceiling =outliers not removed

mean distances over query/corpus, not normalised, distance ceiling =outliers not removed

Figure 3.7: mean distances over query/corpus, not normalised, distance ceiling =outliers not removed

Table 3.3: mean/median table for model: 2
target q n mean median
obs a 46318 525 92
ref a 68618 2305 118
obs b 2287 275 109
ref b 1315 1771 111
obs c 6253 666 89
ref c 1504 1147 119
obs d 6171 441 105
ref d 1765 2214 124
obs e 4278 298 109
ref e 795 2636 116
obs f 2520 249 77
ref f 497 1627 124
median distances over query/corpus, not normalised, distance ceiling =outliers not removed

Figure 3.8: median distances over query/corpus, not normalised, distance ceiling =outliers not removed

distances relation, not normalised, distance ceiling =outliers not removed

Figure 3.9: distances relation, not normalised, distance ceiling =outliers not removed

distances normalised vs. raw

Figure 3.10: distances normalised vs. raw

3.4 evaluation model: 3

3.4.1 meta

eval output data: 13, normalised to all, distance ceiling = outliers removed

3.4.2 parameter setting

##                value
## norm_target _rel_all
## det.t           TRUE
## limit           TRUE
## author          TRUE
## url             TRUE
## embed1          TRUE
## embed2             f
## range1          TRUE
## range2             f
## rel             TRUE
## lme            FALSE
## lemma          FALSE

3.4.3 anova analysis

3.4.3.1 anova plain

formula: [dist_rel_all ~ target*q*det]

##                  Df     Sum Sq    Mean Sq   F value    Pr(>F)    
## target            1 1.2830e+09 1283010757 7336.4625 < 2.2e-16 ***
## q                 5 3.4949e+07    6989793   39.9688 < 2.2e-16 ***
## det               1 4.6410e+06    4641007   26.5380 2.588e-07 ***
## target:q          5 7.7932e+06    1558646    8.9126 1.786e-08 ***
## target:det        1 7.1283e+05     712833    4.0761  0.043496 *  
## q:det             2 2.5680e+06    1283981    7.3420  0.000648 ***
## target:q:det      1 2.0345e+06    2034482   11.6335  0.000648 ***
## Residuals    126209 2.2072e+10     174881                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4.3.2 anova of linear regression model

[anova(summary(lmer))]

## Type III Analysis of Variance Table with Satterthwaite's method
##                 Sum Sq   Mean Sq NumDF  DenDF   F value    Pr(>F)    
## target         3245706   3245706     1   3519   23.4567 1.333e-06 ***
## q              2091953    418391     5 122421    3.0237 0.0098706 ** 
## det              34508     34508     1 118425    0.2494 0.6175055    
## range        142964301 142964301     1   1025 1033.2042 < 2.2e-16 ***
## embed.score   71204325  71204325     1 122690  514.5942 < 2.2e-16 ***
## target:q       2202162    440432     5 123486    3.1830 0.0070933 ** 
## target:det     1534830   1534830     1 123325   11.0922 0.0008672 ***
## q:det          1019818    509909     2 120804    3.6851 0.0250971 *  
## target:q:det    623611    623611     1 123315    4.5068 0.0337615 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4.3.3 linear regression coefficients

formula: [dist_rel_all ~ target*q*det+(1|aut_id)+range+(embed.score)+(1|url_id)]

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: eval(expr(lmeform))
##    Data: dfa
## 
## REML criterion at convergence: 1859224
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8643 -0.5282 -0.1721  0.2469  6.9244 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  aut_id   (Intercept)   8101    90.01  
##  url_id   (Intercept)  23223   152.39  
##  Residual             138370   371.98  
## Number of obs: 126226, groups:  aut_id, 8238; url_id, 2145
## 
## Fixed effects:
##                    Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)       7.789e+02  8.688e+00  8.969e+03  89.651  < 2e-16 ***
## targetref        -7.312e+01  1.061e+01  1.300e+03  -6.893 8.50e-12 ***
## qb               -3.390e+01  2.572e+01  1.218e+05  -1.318 0.187483    
## qc               -3.717e+01  9.261e+00  1.226e+05  -4.014 5.98e-05 ***
## qd               -5.353e+01  3.748e+02  1.184e+05  -0.143 0.886426    
## qe                4.198e+01  6.460e+00  1.247e+05   6.498 8.14e-11 ***
## qf               -3.185e+01  8.240e+00  1.244e+05  -3.866 0.000111 ***
## det               2.144e+01  8.041e+00  1.229e+05   2.667 0.007662 ** 
## range            -9.786e-02  3.044e-03  1.025e+03 -32.143  < 2e-16 ***
## embed.score      -3.080e+02  1.358e+01  1.227e+05 -22.685  < 2e-16 ***
## targetref:qb      3.136e+01  2.894e+01  1.225e+05   1.083 0.278599    
## targetref:qc      3.842e+01  2.154e+01  1.237e+05   1.784 0.074435 .  
## targetref:qd      7.432e-01  2.113e+01  1.238e+05   0.035 0.971935    
## targetref:qe     -3.910e+01  1.602e+01  1.239e+05  -2.441 0.014662 *  
## targetref:qf      3.033e+01  2.039e+01  1.238e+05   1.488 0.136766    
## targetref:det    -2.490e+01  1.826e+01  1.239e+05  -1.363 0.172784    
## qb:det            9.962e+01  2.826e+01  1.219e+05   3.526 0.000423 ***
## qd:det            6.144e+01  3.747e+02  1.184e+05   0.164 0.869736    
## targetref:qb:det -8.754e+01  4.124e+01  1.233e+05  -2.123 0.033761 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients
## Some predictor variables are on very different scales: consider rescaling

3.4.4 plots

compare distances by corpus, normalised to all, distance ceiling =  outliers removed

Figure 3.11: compare distances by corpus, normalised to all, distance ceiling = outliers removed

mean distances over query/corpus, normalised to all, distance ceiling =  outliers removed

Figure 3.12: mean distances over query/corpus, normalised to all, distance ceiling = outliers removed

Table 3.4: mean/median table for model: 3
target q n mean median
obs a 42836 395 196
ref a 58615 203 79
obs b 2116 481 279
ref b 1130 204 75
obs c 5770 388 191
ref c 1274 203 80
obs d 5654 437 243
ref d 1525 205 83
obs e 3911 473 248
ref e 671 211 75
obs f 2311 374 224
ref f 413 195 79
median distances over query/corpus, normalised to all, distance ceiling =  outliers removed

Figure 3.13: median distances over query/corpus, normalised to all, distance ceiling = outliers removed

distances relation, normalised to all, distance ceiling =  outliers removed

Figure 3.14: distances relation, normalised to all, distance ceiling = outliers removed

distances normalised vs. raw

Figure 3.15: distances normalised vs. raw

3.5 evaluation model: 4

3.5.1 meta

eval output data: 13, normalised to ref, distance ceiling = outliers removed

3.5.2 parameter setting

##                value
## norm_target _rel_ref
## det.t           TRUE
## limit           TRUE
## author          TRUE
## url             TRUE
## embed1          TRUE
## embed2             f
## range1          TRUE
## range2             f
## rel             TRUE
## lme            FALSE
## lemma          FALSE

3.5.3 anova analysis

3.5.3.1 anova plain

formula: [dist_rel_ref ~ target*q*det]

##                  Df     Sum Sq    Mean Sq   F value    Pr(>F)    
## target            1 2.5135e+09 2513546743 7336.4625 < 2.2e-16 ***
## q                 5 6.8469e+07   13693706   39.9688 < 2.2e-16 ***
## det               1 9.0922e+06    9092198   26.5380 2.588e-07 ***
## target:q          5 1.5268e+07    3053543    8.9126 1.786e-08 ***
## target:det        1 1.3965e+06    1396511    4.0761  0.043496 *  
## q:det             2 5.0309e+06    2515448    7.3420  0.000648 ***
## target:q:det      1 3.9858e+06    3985754   11.6335  0.000648 ***
## Residuals    126209 4.3240e+10     342610                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5.3.2 anova of linear regression model

[anova(summary(lmer))]

## Type III Analysis of Variance Table with Satterthwaite's method
##                 Sum Sq   Mean Sq NumDF  DenDF   F value    Pr(>F)    
## target         6358663   6358663     1   3519   23.4567 1.333e-06 ***
## q              4098347    819669     5 122421    3.0237 0.0098706 ** 
## det              67605     67605     1 118425    0.2494 0.6175055    
## range        280081403 280081403     1   1025 1033.2042 < 2.2e-16 ***
## embed.score  139496414 139496414     1 122690  514.5942 < 2.2e-16 ***
## target:q       4314256    862851     5 123486    3.1830 0.0070933 ** 
## target:det     3006886   3006886     1 123325   11.0922 0.0008672 ***
## q:det          1997926    998963     2 120804    3.6851 0.0250971 *  
## target:q:det   1221717   1221717     1 123315    4.5068 0.0337615 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5.3.3 linear regression coefficients

formula: [dist_rel_ref ~ target*q*det+(1|aut_id)+range+(embed.score)+(1|url_id)]

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: eval(expr(lmeform))
##    Data: dfa
## 
## REML criterion at convergence: 1944096
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8643 -0.5282 -0.1721  0.2469  6.9244 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  aut_id   (Intercept)  15871   126.0   
##  url_id   (Intercept)  45496   213.3   
##  Residual             271080   520.7   
## Number of obs: 126226, groups:  aut_id, 8238; url_id, 2145
## 
## Fixed effects:
##                    Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)       1.090e+03  1.216e+01  8.969e+03  89.651  < 2e-16 ***
## targetref        -1.024e+02  1.485e+01  1.300e+03  -6.893 8.50e-12 ***
## qb               -4.744e+01  3.600e+01  1.218e+05  -1.318 0.187483    
## qc               -5.203e+01  1.296e+01  1.226e+05  -4.014 5.98e-05 ***
## qd               -7.492e+01  5.246e+02  1.184e+05  -0.143 0.886426    
## qe                5.876e+01  9.042e+00  1.247e+05   6.498 8.14e-11 ***
## qf               -4.458e+01  1.153e+01  1.244e+05  -3.866 0.000111 ***
## det               3.001e+01  1.125e+01  1.229e+05   2.667 0.007662 ** 
## range            -1.370e-01  4.261e-03  1.025e+03 -32.143  < 2e-16 ***
## embed.score      -4.311e+02  1.900e+01  1.227e+05 -22.685  < 2e-16 ***
## targetref:qb      4.389e+01  4.051e+01  1.225e+05   1.083 0.278599    
## targetref:qc      5.378e+01  3.015e+01  1.237e+05   1.784 0.074435 .  
## targetref:qd      1.040e+00  2.957e+01  1.238e+05   0.035 0.971935    
## targetref:qe     -5.472e+01  2.242e+01  1.239e+05  -2.441 0.014662 *  
## targetref:qf      4.246e+01  2.853e+01  1.238e+05   1.488 0.136766    
## targetref:det    -3.485e+01  2.556e+01  1.239e+05  -1.363 0.172784    
## qb:det            1.394e+02  3.955e+01  1.219e+05   3.526 0.000423 ***
## qd:det            8.600e+01  5.244e+02  1.184e+05   0.164 0.869736    
## targetref:qb:det -1.225e+02  5.772e+01  1.233e+05  -2.123 0.033761 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients
## Some predictor variables are on very different scales: consider rescaling

3.5.4 plots

compare distances by corpus, normalised to ref, distance ceiling =  outliers removed

Figure 3.16: compare distances by corpus, normalised to ref, distance ceiling = outliers removed

mean distances over query/corpus, normalised to ref, distance ceiling =  outliers removed

Figure 3.17: mean distances over query/corpus, normalised to ref, distance ceiling = outliers removed

Table 3.5: mean/median table for model: 4
target q n mean median
obs a 42836 553 275
ref a 58615 285 111
obs b 2116 674 390
ref b 1130 286 104
obs c 5770 543 268
ref c 1274 284 112
obs d 5654 612 340
ref d 1525 287 116
obs e 3911 662 347
ref e 671 295 105
obs f 2311 523 313
ref f 413 273 111
median distances over query/corpus, normalised to ref, distance ceiling =  outliers removed

Figure 3.18: median distances over query/corpus, normalised to ref, distance ceiling = outliers removed

distances relation, normalised to ref, distance ceiling =  outliers removed

Figure 3.19: distances relation, normalised to ref, distance ceiling = outliers removed

distances normalised vs. raw

Figure 3.20: distances normalised vs. raw

3.6 evaluation model: 6

3.6.1 meta

eval output data: 13, not normalised, distance ceiling =outliers removed

3.6.2 parameter setting

##             value
## norm_target      
## det.t        TRUE
## limit        TRUE
## author       TRUE
## url          TRUE
## embed1       TRUE
## embed2          f
## range1       TRUE
## range2          f
## rel         FALSE
## lme         FALSE
## lemma       FALSE

3.6.3 anova analysis

3.6.3.1 anova plain

formula: [dist ~ target*q*det]

##                  Df     Sum Sq Mean Sq F value    Pr(>F)    
## target            1    3284330 3284330 84.1223 < 2.2e-16 ***
## q                 5    1633205  326641  8.3663  6.39e-08 ***
## det               1     431404  431404 11.0496 0.0008873 ***
## target:q          5     441118   88224  2.2597 0.0457798 *  
## target:det        1      16732   16732  0.4286 0.5126999    
## q:det             2      25549   12774  0.3272 0.7209470    
## target:q:det      1       6009    6009  0.1539 0.6948226    
## Residuals    126209 4927490433   39042                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.6.3.2 anova of linear regression model

[anova(summary(lmer))]

## Type III Analysis of Variance Table with Satterthwaite's method
##                Sum Sq  Mean Sq NumDF  DenDF   F value Pr(>F)    
## target            218      218     1  17034    0.0061 0.9377    
## q              109358    21872     5 124317    0.6129 0.6901    
## det             20678    20678     1 121247    0.5794 0.4465    
## range        15332432 15332432     1    912  429.6377 <2e-16 ***
## embed.score  77286239 77286239     1 105351 2165.6761 <2e-16 ***
## target:q       304923    60985     5 125126    1.7089 0.1287    
## target:det      17833    17833     1 124982    0.4997 0.4796    
## q:det           37151    18576     2 123066    0.5205 0.5942    
## target:q:det    23985    23985     1 124972    0.6721 0.4123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.6.3.3 linear regression coefficients

formula: [dist ~ target*q*det+(1|aut_id)+range+(embed.score)+(1|url_id)]

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: eval(expr(lmeform))
##    Data: dfa
## 
## REML criterion at convergence: 1685333
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0402 -0.6622 -0.3317  0.3419  4.1697 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  aut_id   (Intercept)  1394     37.34  
##  url_id   (Intercept)  1072     32.74  
##  Residual             35687    188.91  
## Number of obs: 126226, groups:  aut_id, 8238; url_id, 2145
## 
## Fixed effects:
##                    Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)       2.533e+02  3.618e+00  1.966e+04  70.000  < 2e-16 ***
## targetref         1.326e+00  2.954e+00  1.890e+03   0.449  0.65362    
## qb               -8.195e+00  1.300e+01  1.239e+05  -0.630  0.52845    
## qc               -8.144e+00  4.675e+00  1.243e+05  -1.742  0.08150 .  
## qd               -1.117e+02  1.902e+02  1.212e+05  -0.587  0.55726    
## qe                1.392e+01  3.248e+00  1.256e+05   4.285 1.83e-05 ***
## qf               -6.628e+00  4.145e+00  1.253e+05  -1.599  0.10981    
## det               3.793e+00  4.058e+00  1.245e+05   0.935  0.35005    
## range             1.535e-02  7.406e-04  9.124e+02  20.728  < 2e-16 ***
## embed.score      -3.110e+02  6.682e+00  1.054e+05 -46.537  < 2e-16 ***
## targetref:qb      4.017e+00  1.464e+01  1.244e+05   0.274  0.78373    
## targetref:qc      4.577e+00  1.089e+01  1.253e+05   0.420  0.67442    
## targetref:qd     -2.061e+00  1.069e+01  1.253e+05  -0.193  0.84707    
## targetref:qe     -2.134e+01  8.099e+00  1.255e+05  -2.635  0.00841 ** 
## targetref:qf      8.889e+00  1.031e+01  1.254e+05   0.862  0.38849    
## targetref:det     1.178e+00  9.236e+00  1.253e+05   0.127  0.89855    
## qb:det            1.714e+01  1.428e+01  1.239e+05   1.200  0.23002    
## qd:det            1.126e+02  1.902e+02  1.212e+05   0.592  0.55380    
## targetref:qb:det -1.710e+01  2.086e+01  1.250e+05  -0.820  0.41233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients
## Some predictor variables are on very different scales: consider rescaling

3.6.4 plots

compare distances by corpus, not normalised, distance ceiling =outliers removed

Figure 3.21: compare distances by corpus, not normalised, distance ceiling =outliers removed

mean distances over query/corpus, not normalised, distance ceiling =outliers removed

Figure 3.22: mean distances over query/corpus, not normalised, distance ceiling =outliers removed

Table 3.6: mean/median table for model: 6
target q n mean median
obs a 42836 161 77
ref a 58615 172 81
obs b 2116 169 109
ref b 1130 168 78
obs c 5770 159 75
ref c 1274 175 84
obs d 5654 170 86
ref d 1525 176 83
obs e 3911 175 92
ref e 671 165 71
obs f 2311 144 62
ref f 413 171 82
median distances over query/corpus, not normalised, distance ceiling =outliers removed

Figure 3.23: median distances over query/corpus, not normalised, distance ceiling =outliers removed

distances relation, not normalised, distance ceiling =outliers removed

Figure 3.24: distances relation, not normalised, distance ceiling =outliers removed

distances normalised vs. raw

Figure 3.25: distances normalised vs. raw

3.7 Selbständigkeit: benutzte Hilfestellung

In der vorliegenden Arbeit wurden keinerlei nicht erlaubte Hilfsmittel zur Erstellung von Inhalten verwendet. Die Benutzung von KI beschränkt sich auf (Tabelle):

Table 3.7: Table 3.8: verwendete Hilfsmittel
Hilfsmittel Verwendung
github copilot Hilfe bei der Skripterstellung (R, Python) zur Programmierung der Distanzenberechnung, semantic embeddings und statistischen Auswertung
chatgpt.com dito
claude.ai dito
deepseek.com dito
nomic-embed-text (model) calculate semantic embeddings

3.8 references

literature used and alii…

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
De Marneffe, Marie-Catherine, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. 2021. “Universal Dependencies.” Computational Linguistics, May, 1–54. https://doi.org/10.1162/coli_a_00402.
HuggingFace. 2025. “All-MiniLM-L6-V2 · Hugging Face.” Sentence Transformers. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.
Kjell, Oscar, Salvatore Giorgi, and H. Andrew Schwartz. 2023. “The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning.” Psychological Methods. https://doi.org/10.1037/met0000542.
Kuperberg, Gina R. 2010. “Language in Schizophrenia Part 2: What Can Psycholinguistics Bring to the Study of Schizophrenia…and Vice Versa?” Language and Linguistics Compass 4 (8): 590–604. https://doi.org/10.1111/j.1749-818X.2010.00217.x.
Lee, Kenton, Luheng He, and Luke Zettlemoyer. 2018. “Higher-Order Coreference Resolution with Coarse-to-Fine Inference.” arXiv. https://doi.org/10.48550/arXiv.1804.05392.
Mishara, Aaron L. 2010. “Klaus Conrad (1905–1961): Delusional Mood, Psychosis, and Beginning Schizophrenia.” Schizophrenia Bulletin 36 (1): 9–13. https://doi.org/10.1093/schbul/sbp144.
“Nomic-Ai/Nomic-Embed-Text-V1.5 · Hugging Face.” 2024. https://huggingface.co/nomic-ai/nomic-embed-text-v1.5.
“Nomic-Embed-Text.” n.d. Accessed October 6, 2025. https://ollama.com/nomic-embed-text.
Nussbaum, Zach, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. “Nomic Embed: Training a Reproducible Long Context Text Embedder.” https://huggingface.co/nomic-ai/nomic-embed-text-v1.5.
ottiram. 2025. “Ottiram/MMAX2.” https://github.com/ottiram/MMAX2.
Poesio, Massimo, Artstein, Ron, Uryupina, Olga, Rodriguez, Kepa, Delogu, Francesca, Bristot, Antonella, and Hitzeman, Janet. 2013. “The ARRAU Corpus of Anaphoric Information.” Linguistic Data Consortium. https://doi.org/10.35111/Y3MR-HE10.
Prince, Ellen F. 1981. “Toward a Taxonomy of Given-New Information.” In Syntax and Semantics: Vol. 14. Radical Pragmatics, edited by P. Cole, 223–55. New York: Academic Press.
Rivera, Ivan. 2023. RedditExtractoR: Reddit Data Extraction Toolkit.” https://CRAN.R-project.org/package=RedditExtractoR.
Schwarz, St. 2025. “Poster Appendix: This Papers Scripts for Corpus Build and Statistics on Github.” https://github.com/esteeschwarz/SPUND-LX/tree/main/psych/HA.
Wijffels, Jan. 2023. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’UDPipe’ ’NLPToolkit. https://CRAN.R-project.org/package=udpipe.
Zimmerer, Vitor C., Stuart Watson, Douglas Turkington, I. Nicol Ferrier, and Wolfram Hinzen. 2017. “Deictic and Propositional MeaningNew Perspectives on Language in Schizophrenia.” Frontiers in Psychiatry 8 (February). https://doi.org/10.3389/fpsyt.2017.00017.