Note. M (SD ) for continuous variables, n(%) for categorical variables. p -values were calculated using an
independent sample t -test for differences in continuous variables
and \(\chi^{2}\)-tests for differences in categorical variables. Beck
Anxiety Inventory and Beck Depression Inventory–II scores at the start
of therapy.
Procedure
A detailed codebook was developed to extract study-relevant information
from the archive therapy files. Each file consisted of diagnostic
results, including the patient self-report form and symptom severity
questionnaires, therapy reports, and the therapist’s session-by-session
documentation of therapy content. Three research assistants were trained
in using the codebook by the first author, and interrater reliabilities
were calculated (see below).
Inclusion Criteria.
First, we screened all archive files for usability in the study,
starting with the most recently archived files and progressing
chronologically to the older files. Files were included if they met the
following criteria: diagnosis of unipolar depression (ICD-10: F32.X,
F33.X) or an anxiety disorder (ICD-10: F40.0, F40.1, F41.0, F41.1) and
at least one therapy session after the initial diagnostic phase.
Incomplete files were not included (missing patient self-report form,
completely missing symptom severity questionnaires at the start of
therapy, missing therapy reports, unreadable or missing
session-by-session documentation).
Coding of Files.
Second, based on the codebook, we extracted study-relevant information
on sociodemographic characteristics, symptom severity, trauma history,
and therapy characteristics from all included files. Information on
sociodemographic characteristics and medication at the start of therapy
was retrieved from the patient self-report forms. The severity of
depression symptoms at the start and end of therapy was obtained from
the Beck Depression Inventory–II (BDI-II, Beck & Steer, 1993;
German version by Hautzinger, Keller, & Kühner, 2009), a 21-item
self-report questionnaire based on DSM-IV. The severity of anxiety
symptoms at the start and end of therapy was obtained from theBeck Anxiety Inventory (BAI, Beck, Steer, & Brown, 1996; German
version by Margraf & Ehlers, 2007), a self-report questionnaire with 21
items and a focus on bodily symptoms of stress or anxiety. A measure of
global symptom severity was obtained from the Global Severity
Index (GSI) of the Brief Symptom Inventory (BSI, Derogatis,
1993; German version by Franke, 2000), a 53-item self-report
questionnaire that measures symptoms of a broad range of psychiatric
disorders.
For trauma, as defined by DSM-5 (Association, 2013), interpersonal and
non-interpersonal, as well as single and repeated traumatic events
(Maercker & Augsburger, 2019) before and after age 18, were rated as
present or not present after reading the entire file. For childhood
maltreatment, the five subtypes of physical, emotional, and sexual abuse
and physical and emotional neglect (Bernstein, Ahluvalia, Pogge, &
Handelsman, 1997; Bernstein et al., 2003) were likewise coded as present
or not present. Additionally, the total number of traumatic events was
counted.
The duration of therapy in months and the number of therapy sessions
were obtained from the therapist’s final report. Trauma-specific
interventions were defined as self-calming techniques (Reddemann, 2010),
skills training (Linehan, 1993), trauma-focused therapy techniques
(Neuner, 2012), schema mode interventions (Young et al., 2006), and
other interventions with an explicit trauma focus that did not fall
among the first four interventions. Trauma-unspecific interventions were
defined as inpatient treatment episodes during therapy and a combination
of psychotherapy with other non-medical psychosocial interventions. Both
types of interventions were rated as either used or not used based on
the therapist’s session-by-session documentation and the final report.
Monitoring of Data
Quality.
Third, the collected data were screened for implausible values, and the
original files were consulted again to correct data entry errors. To
assess the interrater reliability, a total of 70 files (20.59%), chosen
randomly, were rated again by another research assistant. In case of
disagreement between two raters, the first author decided on the rating
to use for the data analysis, consulting the original therapy file again
if necessary.
Data Analysis
All analyses were conducted using R (Version 4.2.2; R Core Team, 2021)
and the R-packages irr (Version 0.84.1; Gamer, Lemon, & Singh,
2019), lme4 (Version 1.1.31; Bates, Mächler, Bolker, & Walker,
2015), mice (Version 3.15.0; van Buuren & Groothuis-Oudshoorn,
2011), mitml (Version 0.4.4; Grund, Robitzsch, & Luedtke, 2021),papaja (Version 0.1.1; Aust & Barth, 2022), and tidyverse(Version 1.3.2; Wickham et al., 2019).
Scoring and Indices.
First, we calculated questionnaire scores and aggregated trauma and
intervention indices. For the BAI, BDI-II, and GSI, scores were
calculated as described in the corresponding manuals (sum scores for BAI
and BDI-II, mean score for GSI). Additionally, the Reliable Change
Index (RCI, Jacobson, Follette, & Revenstorf, 1984; Jacobson & Truax,
1991) was applied for BAI, BDI-II, and GSI to assess the number of
patients who recovered or improved during therapy, as recommended by
Loerinc et al. (2015). To calculate the RCI as Jacobson and Truax (1991)
described, we first divided each patient’s change score by its standard
error, resulting in a continuous change score with change values
> 1.96 indicating reliable improvement and change values
< -1.96 indicating reliable deterioration. We then categorized
patients who reliably improved and fell below the cut-off score for
clinically relevant symptom severity of the corresponding questionnaire
as recovered, patients who reliably improved and did not fall below the
cut-off score as improved, patients with reliable deterioration as
deteriorated, and all others as unchanged.
For trauma, two indices were calculated. The first index, global trauma,
differentiated between cases with at least one traumatic event and cases
without any traumatic event. The second index, interpersonal trauma,
differentiated between cases with at least one interpersonal traumatic
event, cases without interpersonal traumatic event but at least one
non-interpersonal traumatic event, and cases without any traumatic
event. Diverging from the pre-registration for Hypothesis 3, we could
not further differentiate between childhood interpersonal and
adulthood-only interpersonal trauma because cases with adulthood-only
interpersonal trauma were sporadic in the sample (2.35%).
Interrater Reliability.
Second, we assessed agreement percentages and interrater reliabilities
(Cohen’s Kappa or Fleiss Kappa, as appropriate) for all trauma and
intervention ratings and all rater pairs. Regarding the trauma ratings,
the aggregated global and interpersonal trauma indexes achieved
satisfying interrater reliabilities (\(\kappa\) = 0.62-1), as did the
number of traumatic events rating (ICC = 0.81). Some trauma
subtypes, in contrast, had insufficient reliabilities (\(\kappa\) =
-0.05-1). For 52.78% of all trauma subtypes, interrater reliability was
unacceptably low (\(\kappa\) < .61), so trauma subtypes could
not be used for exploratory analysis as initially intended.
Consequently, all further analyses were based on the global and
interpersonal trauma indexes.
Regarding the intervention ratings, the aggregated intervention index
achieved satisfying interrater reliability (\(\kappa\) = 0.78). The
intervention subtypes generally had satisfying interrater reliabilities
as well (\(\kappa\) = 0.38-1), except for the rarely used trauma-focused
interventions (\(\kappa\) = -0.01) and other, not further specified
interventions (\(\kappa\) = 0). Since the interrater agreement was
acceptable for all intervention types (70.00-93.10%), we included all
intervention subtypes in the analysis as pre-registered to provide a
more detailed picture. The interrater reliability and agreement for all
indices and subtypes used for analyses can be found in the Supplemental
Material, Tables 5 and 6.
Treatment of Missing
Data.
Third, a significant challenge for data analyses was the substantial
percentage of missing values, primarily due to post-treatment symptom
severity questionnaires missing in the original files. Whereas trauma
and intervention ratings had no missing values and missingness was low
on sociodemographic variables, treatment-related variables, and
pre-treatment BDI-II and BAI questionnaires, at post-treatment, both
BDI-II (37.06% missingness) and BAI (43.24% missingness) suffered from
missing data. Missingness on the GSI was even higher (pre-treatment:
23.24%, post-treatment: 65.00%). For further information on missing
values, see Supplemental Material, Table 7.
We applied Multiple Imputation (Rubin, 1987) for missing BAI and BDI-II
values to minimize bias due to missing values during hypotheses testing.
We used fully conditional specification and a two-level imputation model
for longitudinal data, including all model-relevant variables and their
interaction terms as well as auxiliary variables with a correlation of
at least r \(\geq\) .2 with either the imputed variable or
missingness (Kleinke, Reinecke, Salfrán, & Spiess, 2020; Spiess,
Kleinke, & Reinecke, 2021; van Buuren, 2018). Fifty imputed data sets
were generated with 20 iterations each (White, Royston, & Wood, 2011).
For the GSI, the percentage of missing values at post-treatment exceeded
50%, making it prone to estimation errors even using Multiple
Imputation (Barzi & Woodward, 2004; Marshall, Altman, Royston, &
Holder, 2010). Hence, contrary to the preregistration, we excluded the
GSI from further analysis.
Hypothesis Testing.
Finally, to investigate differences between the trauma groups in
categorical outcome measures, e.g., trauma-specific interventions and
Reliable Change Index, Fisher’s exact tests were conducted and pooled in
case of missingness as proposed by Eekhout, Wiel, and Heymans (2017).
Differences between the two trauma groups in continuous outcome
variables were tested using between-subjects t -tests. Differences
between the trauma groups, the depression groups, and the intervention
groups in symptom severity trajectories were tested using the
recommended pooling procedure for multi-parameter pooling of mixed
designs described by van Ginkel and Kroonenberg (2014). We deviated from
the pre-registered mixed-effects ANOVA models because of the multiple
imputation context. Instead, we compared linear mixed effects models
with and without the time, trauma, depression, and intervention factors
and applied the D1 pooling method (Grund, Lüdtke, & Robitzsch, 2016).
Full models always contained the factors of time point and trauma and
their interaction in the fixed part and contained random intercepts by
patient. The full model for the effect of the depression group
(Hypothesis 2c) additionally contained a depression factor and its
interaction effects. The full model for the effect of trauma-specific
interventions (Hypothesis 4) contained an intervention factor and its
interaction effects and number of therapy sessions as a covariate. In
line with Ludbrook (2013), we used one-sided tests for pre-registered
hypotheses. The standard significance level of p < .050
was set for all tests. The Bonferroni method was applied to adjust for
multiple comparisons in case of multiple testing.