2 15303.ha.draft

2.1 subject

In this paper we want to explore reference marking, coherence and information structure in schizophrenia language by measuring distance of similar nouns preceded by specified determinants.¹
Inspired by Zimmerer et al. (2017) we are interested in observations concerning coherence and propositional statement conditions in schizophrenia language, as these linguistic markers appear underinvestigated in that fields research whilst they seem to play a crucial role within target group language features. (As such seen as asset of thinking or world building capacity which might suffer from linguistic standard deviation within the range of negative symptoms.) There seems to be a lot research done concerning frequency based analyses of how typical patients language might appear and how that language deviates in terms of keywords or word fields, but our interest is more dedicated to the structural layer of the language which might not be catched by raw frequencies. In our opionion disturbances on that layer might be hidden and not to grasp easily such that a listener would not always be able to precisely figure out what the disturbing factor is. Missing coherence, which we will investigate, may be a too narrow explanation to many impressions that schizophrene language leaves the listener with. But it seems to be a good starting point to unveiling structural patterns of patients language.

2.2 definitions, terminology, assumptions

2.2.1 coherence

There are several preliminary affordances to a successful communication. One is the coherence of a text = way of communication, which accounts for the partner being able to follow the topic and relate subjects and objects referenced. There can be more or less common references and such, that need to be embedded in context to be understood. The underlying network of informations to create that context is what we call information structure of a text. The level of complexity of that network defines how simple it would be to gather the reference from the given information. We might have to go back many sentences or even infer reference from metaphors or such to be able to understand what is said while in the other case simply recall the subject of the last sentence to get the meaning (reference) of the pronoun in also {she} said thisandthat.... The capacity to imagine or have in mind, what concrete information is accessible to the adressee (what he actually knows or can infer) is key to a successful communication, since factors like common-ness, weltwissen and shared knowledge between adressant and adressee and informations accessible from the text itself vary depending on topic, setting, intimacy of the partners and such. So one cannot always be sure that the information provided is sufficient but the grade to which one can give a correct estimate to this sufficiency should here be a measure for our hypothesis, that the very coherence in disturbed language is deficient which lets an utterance be more difficult to understand within the frame of given information. Now one indicator of coherence we assume is reference distance where according to our hypothesis a larger distance would be observed in places where the adressant overestimates² the ability of the partner to follow a reference. That would mean that we find a medium shorter distance between referent and reference in the reference corpus³ and larger distances in the target corpus. The references we are interested in are nouns that appear as anaphors i.e. here as noun analogies. The assumption is that if a noun is repeated and is combinded with certain preceding determiners, the speaker assumes that the adressee has some knowledge of what is talked about, depending on the strength of the determination. So e.g. this, that, those, these would be rather strong determiners requiring that the noun was introduced before; these are four determiners of our 5 conditions as listed below.

2.2.2 premises

2.2.2.1 deictic anchoring and propositional complexity

Zimmerer et al. (2017) consider “Deictic anchoring […] an inherent part of the process by which we make references to aspects in the world including entities, events, locations, and time.” and define propositions as being “statements about the world which can be true or false.” They mention, according to (Kuperberg 2010) “that in people with schizophrenia, cortical activity to semantic abnormalities in sentences is particularly small compared to controls if interpretation requires integration of several sentences” which can mean, that patients are not realising if their utterances are somehow disturbed on the semantics level. If “Delusions and thought disorder can be considered disruptions of propositional meaning” then the patients feeling for their stated propositions (required to the adressee) and further the estimation about what he/she can assume as familiar to the adressee can be wrong. Following Klaus Konrad (Mishara 2010) who “described the onset of a delusion as the loss of ability to transcend an experience and see it with the eyes of others” Zimmerer et al. (2017) assume that “in thought disorder, the ability to express coherent propositions can be severely impaired.” We take that as premise for our research question.

2.3 questions

Measuring the referent-reference distance which we assume as an indicator for coherence we hope to find empirical evidence for disturbed or not world building capacities within schizophrenia language. Premising that a large noun distance indicates a low reference-referent association we hypothesise that in a language/ToM setting where the speakers estimation of the audiences context understanding capacities is disturbed we will find higer medium scores for the distance under matching conditions. An environment which has potential to test our hypothesis is the reddit thread r/schizophrenia. As reference corpus we chose reddit r/unpopularopinion. The distance measured should give us information structural evidence of how strong the noun occurences⁴ are connected, i.e. if a noun appears out of the blue mostly or if it somewhere before has been introduced to the audience and thus would be more or less legitimated to be determined by an antecedent. Our basic assumptions rely on the taxonomy of given end new information coined by Prince (1981). She develops a hierarchy of references⁵ with specific relations to each other, where each item is attributed in terms of familiarity⁶, that defines ranges of 1. givennes in the sense of predictability/recoverability, 2. givenness in the sense of saliency, 3. givenness in the sense of “shared knowledge”. (cf. Prince (1981), pp. 226) We base our hypothesis of reference distance as indicator for coherence on this model assuming that the reference/association strength⁷ determines the level of text coherence.

2.4 data

We built a corpus of the reddit r/schizophrenia thread (n =1500371 tokens) and a reference corpus of r/unpopularopinion (n =980731 tokens). Both were pos-tagged using the R udpipe package (Wijffels (2023)) which tags according to the universal dependencies tagset maintained by De Marneffe et al. (2021). Still the available data can only, within the pipeline of steadily growing the corpus and devising the noun distances developed be just a starting point from where with more datapoints statistical evaluation becomes relevant.
The dataframe used for our model (actual: dataset 13) consists of 142321 distance datapoints (sample cf. Tab. 2.1 below) derived from the postagged corpus. Because the ranges of the url threads vary heavily between target and reference corpus, the distances are (in evaluation M1) normalised to the target corpus (cf. Fig. 3.5 for the raw vs. normalised distances comparison.) Outliers are excluded from the analysis since they very probably do not fulfill to can be counted as anaphoric references. We silently assume that all of the noun distances which are not by value excluded as outliers occur as anaphoric references. A manual annotation and close reading of the text would be necessary to exactly determine wether the references are associated at all. This may be the task for another qualitative evaluation of our quantitative study.

Table 2.1: Table 2.2: data sample of distances df
token	upos	target	pos	prepos	url_id	range	q	det	aut_id	total_mentions	dist	embed.score	dist_rel_within	dist_rel_all	dist_rel_obs	dist_rel_ref	embed_c
drinks	NOUN	ref	45773	PROPN	1875	5000	a	0	3824	40	96	0.536	86	62	37	86	0.113
today	NOUN	obs	123999	PUNCT	257	309	a	0	195	4	106	0.544	653	1099	653	1539	0.121
drawer	NOUN	ref	502184	AUX	2051	4942	a	0	7700	26	71	0.403	64	46	27	64	-0.020
people	NOUN	ref	442314	PUNCT	2028	6775	a	0	3838	82	27	0.448	18	13	8	18	0.025
head	NOUN	obs	378683	PRON	605	1032	e	0	1270	2	263	0.392	485	817	485	1143	-0.031
office	NOUN	ref	411782	ADP	2019	3894	a	0	7023	7	459	0.440	529	378	224	529	0.017
game	NOUN	obs	1077698	DET	1630	1969	c	1	127	35	37	0.447	36	60	36	84	0.024
government	NOUN	ref	599613	NOUN	2107	2922	a	0	8477	2	33	0.324	51	36	21	51	-0.099
musicians	NOUN	ref	577699	AUX	2095	2281	a	0	4155	2	655	0.393	1288	920	546	1288	-0.030
Dad	NOUN	obs	556138	PUNCT	828	8308	a	0	33	13	739	0.389	169	285	169	399	-0.034

2.5 methods

To compute distances we queried the corpus for matching conditions where certain (probable) determiners appear before analogue nouns (anaphors).

condition	value
a	any !(b,c,d,e,f)
b	this, that, those, these
c	the
d	a, any, some
e	my
f	his, her, their, your

We decided for these 5 sets of determiners in order to see wether distances maybe influenced if the duplicated nouns are preceded by them. We would expect condition b to show different if not reziproke effects as condition d ⁸ and yet the texts in the reference corpus show the expected behaviour while in the target corpus not.

For each datapoint we collect variables as:

thread url
author (anonymised)
thread length (tokens)
lexical diversity (type/token ratio)
lemma
distance (to the preceding occurence, e.g. for three occurences of dog we collect 2 distance datapoints)

The main function to determine the distances runs on a subset of the corpus with only including all nouns and their position in the corpus. It finds all duplicated nouns per url thread and computes their distances by token position.

2.6 reflections

2.6.1 range

Evaluating with a growing corpus and (reaching up to M[odel]12 with our methods of computing distances) we interestingly find our basic hypothesis tested again, showing an overall larger distance of analogue nouns within the range of 1 thread url for the target corpus. While until M7 we devised distances from a manually assigned url identifier we saw the necessity to define our “range of interest” according to the original http url of the thread, since with a growing corpus the old url ids - derived from the get_thread_url() method of the redditExtractoR package (Rivera (2023)) used for fetching the reddit content - there a no new url ids created since one url fetch gets each time always only around 1000 urls. To ensure unique url ranges within the corpus we as assigned the range (within which the noun distance is calculated) to the real thread url. The corpus itself is after each fetch sorted after url and timestamp so it represents the real flow of conversation within one thread which is important since our distance model is based on the token distances within that thread, so they should follow their natural occurence in time.
The url range is an important variable which we used for normalising the distance values since the mean distances could also depend on the overall thread length. For that we calculated for each normalisation method as are 1. per target, 2. within target and 3. cross target a range factor by which the distance values are divided. The final regression model posits fixed effects of condition, target, det, range and embed score (where target, condition and det are interacting) and random effects of the url_id.

2.6.2 author trace id

Another new feature in M11 is the aut_id variable which represents the comment author and is unique to that. In the base .sqlite database the authors are already anonymised, so there should be no way from the published data back to the original author name of the comment. And as expected, including aut_id as random effect in the linear regression model, the significance level for the covariables of interest as are

q = the condition matching of the noun-preceding token
det = wether that match has postag “DET”
target = obs or reference corpus

finally increases.

2.6.3 lexical diversity

We thought about some serious caveats in M11: If (lucky for our hypothesis) the target corpus has significantly higher distance scores over nearly all conditions, does that automatically indicate a less coherent reference-referent association within what is expressed in the comments? Couldn’t we also assume that if the analogue nouns appear more distanced in general that a topic which is including these nouns is simply expanding over a wider range resp. timeframe? What does that mean for our assumptions in terms of coherence? A good way here could be to integrate (from M3) a general lexical diversity factor per url as fixed effect because we can assume that a higher type/token ratio logically decreases the probability of a noun appearing multiple times within a range and we could take that effect into account.

2.6.4 semantics, word field, embeddings

Further we created another covariable possible to integrate in the evaluation model: The semantic embedding of one specific noun appearing on its specific position in the thread range, computed with help of an open LL word embedding model (Nussbaum et al. (2024).) This is a common AI way of devising semantic relations in a corpus which exceeds a just frequency based keyword analysis. Using an LLM here allows for a distinctive identification of world field embeddings of the noun in question. In that way we get another variable linguistic feature extracted which may give general insights into the level of standardisation that applies to the corpora. So if a noun is found to be embedded with a high score into its context (the url thread) then it can be very much expected to be found there and appears less out-of-context.⁹

2.6.5 statistics

In this context we thought about what it means statistically, if a high-score embedded word also ranks high in (distance) significance i.e. generally what the relations of the covariates in the context of the linear regression evaluation express. Let us picture this:

a word receives a high embed score if it is highly semantically related to the context within which it appears, here the comment thread.
therefore the necessity to introduce/elaborate on it sinks, since it may be considered a “known” or “inferable” entity within the context given.
now if a person is using this word, the determined use appears less incoherent by itself.
the reference distance thus may increase without losing in coherence.
conclusion: if we for our linear regression use a (base) formula like distance ~ corpus , a continuos embed_score predictor between -1 and 1 should correlate positive with the estimates for dist if applied correctly, nestcepas?

2.6.6 caveats

Since devising the word embed score does take much computing ressources we had a script run on a server that solves the computing. But the first essai to integrate the new var into the evaluation model failed due to levels < 2. Why? Because in the beginning we ran the script just over a few chunks of the complete url ranges in the corpus¹⁰ and that is sorted after target,¹¹ we did not compute any values for the reference corpus. So we learned this way again on linear regression models which require that a variable has more than one level (which would not be the case if the lmer() function excludes all NA rows: there would be no observations left with target=ref since all its embed.score values are NA and so all target.ref rows will be removed during regression.)
The issue is solved since we found a ressource saving method of computing the embed scores with a local instance of ollama that provides an API to use the model.

2.7 model evaluations

2.7.1 covariances

Effects of the same direction for target OBS and REF are observed in qc, range (with positive effects in qc) while contrary effects are observed in qb, qd, qe, qf, det, embed.score, qb:det, qd:det (with negative effects in target=obs and vcvs.)
In words:

the antecedents the seem to allow a wider distance between referent and reference in both target=OBS and target=REF.
the antecedents this,that,these,those - my - your,their,his,her decrease distance in target=OBS and increase distance values in target=REF; condition d (a,an,some,any) vcvs.
higher embed.score values (better embedded noun) decrease distance in target=OBS and increase distance values in target=REF. (cf. par 3.7.5.4, better embedding allows wider distance > the expectation seems only valid for the reference corpus!)

sidenote: Positing the url range only as fixed effect instead of normalising the distances still estimates smaller distances for the reference corpus, but with no significance, the only significant difference with that regression formula shows in target=REF under condition e (antecedents: my).

2.7.2 model fazit

As you can cf. in the appendix with the seperate coefficient tables for each evaluation model, we find over all normalised subsets (vs. obs/ref/all) significantly smaller distances in the reference corpus with varying effects for the conditions. In the subsets, where we didnt normalise or remove outliers, we find the opposite effect; the raw data does not prove our hypothesis. But just looking into the (raw) mean values plot of Fig. 3.7 we clearly see that normalising and removing outliers is necessary since mean distances there extend up to over 2000 tokens thus we wouldn’t like to count all analogue noun occurences here as anaphora.

2.8 conclusion

After evaluating over the different approaches we find our hypothesis proved, that anaphora distances in the target corpus (target=OBS) stretch over a significantly (p<0.001) wider range of tokens between reference and referent in contrast to the chosen reference corpus. With our assumptions this could prove a less appropriate estimate for the coherence of the own texts produced in schizophrene language still having in mind, that a wider distance is not stating incoherence in general but instead just that these speakers allow for a wider anaphora distance in their text production. If these distances indeed lead to less coherent texts compared to the reference corpus must be subject to close reading and annotating samples manually and questioning them in terms of coherence by skilled readers though annotation may vary strongly depending on the disposition of readers and their general capacities of infering references. But if we agree that shorter reference distances increase text coherence then we might say the texts produced in the target corpus are less coherent than those in the reference corpus which alignes with the prospect classification of patients language in literature.

2.9 limitations

We had to have a lot of silent assumptions, but the main limitation is that we will have to base our specification of the target corpus as being one that is containing schizophrene language mainly on the statements of the reddit users in our target corpus which do describe themselves as being diagnosed schizophrene to a large amount. To what extend these statements and assignments or identifications are true we cannot say and therefore limit the value of our findings only to that group of speakers.

References

De Marneffe, Marie-Catherine, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. 2021. “Universal Dependencies.” Computational Linguistics, May, 1–54. https://doi.org/10.1162/coli_a_00402.

Kuperberg, Gina R. 2010. “Language in Schizophrenia Part 2: What Can Psycholinguistics Bring to the Study of Schizophrenia…and Vice Versa?” Language and Linguistics Compass 4 (8): 590–604. https://doi.org/10.1111/j.1749-818X.2010.00217.x.

Mishara, Aaron L. 2010. “Klaus Conrad (1905–1961): Delusional Mood, Psychosis, and Beginning Schizophrenia.” Schizophrenia Bulletin 36 (1): 9–13. https://doi.org/10.1093/schbul/sbp144.

Nussbaum, Zach, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. “Nomic Embed: Training a Reproducible Long Context Text Embedder.” https://huggingface.co/nomic-ai/nomic-embed-text-v1.5.

Prince, Ellen F. 1981. “Toward a Taxonomy of Given-New Information.” In Syntax and Semantics: Vol. 14. Radical Pragmatics, edited by P. Cole, 223–55. New York: Academic Press.

Rivera, Ivan. 2023. “RedditExtractoR: Reddit Data Extraction Toolkit.” https://CRAN.R-project.org/package=RedditExtractoR.

Wijffels, Jan. 2023. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’UDPipe’ ’NLP’ Toolkit. https://CRAN.R-project.org/package=udpipe.

Zimmerer, Vitor C., Stuart Watson, Douglas Turkington, I. Nicol Ferrier, and Wolfram Hinzen. 2017. “Deictic and Propositional Meaning—New Perspectives on Language in Schizophrenia.” Frontiers in Psychiatry 8 (February). https://doi.org/10.3389/fpsyt.2017.00017.

which can be considered as a control condition as it should naturally allow wider distances between the following noun and the reference than all other conditions.↩︎
only according to the LLM training data, which is still a blackbox↩︎
to spare ressources↩︎
where “obs” comes first↩︎
informations in a text↩︎
cf. Prince: speaker assumptions about hearer familiarity = assumed familiarity↩︎
which should be weaker with growing distance between reference-referent↩︎
which can be considered as a control condition as it should naturally allow wider distances between the following noun and the reference than all other conditions.↩︎
only according to the LLM training data, which is still a blackbox↩︎
to spare ressources↩︎
where “obs” comes first↩︎