Note. M (SD ) for continuous variables, n(%) for categorical variables. p -values were calculated using an independent sample t -test for differences in continuous variables and \(\chi^{2}\)-tests for differences in categorical variables. Beck Anxiety Inventory and Beck Depression Inventory–II scores at the start of therapy.

Procedure

A detailed codebook was developed to extract study-relevant information from the archive therapy files. Each file consisted of diagnostic results, including the patient self-report form and symptom severity questionnaires, therapy reports, and the therapist’s session-by-session documentation of therapy content. Three research assistants were trained in using the codebook by the first author, and interrater reliabilities were calculated (see below).

Inclusion Criteria.

First, we screened all archive files for usability in the study, starting with the most recently archived files and progressing chronologically to the older files. Files were included if they met the following criteria: diagnosis of unipolar depression (ICD-10: F32.X, F33.X) or an anxiety disorder (ICD-10: F40.0, F40.1, F41.0, F41.1) and at least one therapy session after the initial diagnostic phase. Incomplete files were not included (missing patient self-report form, completely missing symptom severity questionnaires at the start of therapy, missing therapy reports, unreadable or missing session-by-session documentation).

Coding of Files.

Second, based on the codebook, we extracted study-relevant information on sociodemographic characteristics, symptom severity, trauma history, and therapy characteristics from all included files. Information on sociodemographic characteristics and medication at the start of therapy was retrieved from the patient self-report forms. The severity of depression symptoms at the start and end of therapy was obtained from the Beck Depression Inventory–II (BDI-II, Beck & Steer, 1993; German version by Hautzinger, Keller, & Kühner, 2009), a 21-item self-report questionnaire based on DSM-IV. The severity of anxiety symptoms at the start and end of therapy was obtained from theBeck Anxiety Inventory (BAI, Beck, Steer, & Brown, 1996; German version by Margraf & Ehlers, 2007), a self-report questionnaire with 21 items and a focus on bodily symptoms of stress or anxiety. A measure of global symptom severity was obtained from the Global Severity Index (GSI) of the Brief Symptom Inventory (BSI, Derogatis, 1993; German version by Franke, 2000), a 53-item self-report questionnaire that measures symptoms of a broad range of psychiatric disorders.
For trauma, as defined by DSM-5 (Association, 2013), interpersonal and non-interpersonal, as well as single and repeated traumatic events (Maercker & Augsburger, 2019) before and after age 18, were rated as present or not present after reading the entire file. For childhood maltreatment, the five subtypes of physical, emotional, and sexual abuse and physical and emotional neglect (Bernstein, Ahluvalia, Pogge, & Handelsman, 1997; Bernstein et al., 2003) were likewise coded as present or not present. Additionally, the total number of traumatic events was counted.
The duration of therapy in months and the number of therapy sessions were obtained from the therapist’s final report. Trauma-specific interventions were defined as self-calming techniques (Reddemann, 2010), skills training (Linehan, 1993), trauma-focused therapy techniques (Neuner, 2012), schema mode interventions (Young et al., 2006), and other interventions with an explicit trauma focus that did not fall among the first four interventions. Trauma-unspecific interventions were defined as inpatient treatment episodes during therapy and a combination of psychotherapy with other non-medical psychosocial interventions. Both types of interventions were rated as either used or not used based on the therapist’s session-by-session documentation and the final report.

Monitoring of Data Quality.

Third, the collected data were screened for implausible values, and the original files were consulted again to correct data entry errors. To assess the interrater reliability, a total of 70 files (20.59%), chosen randomly, were rated again by another research assistant. In case of disagreement between two raters, the first author decided on the rating to use for the data analysis, consulting the original therapy file again if necessary.

Data Analysis

All analyses were conducted using R (Version 4.2.2; R Core Team, 2021) and the R-packages irr (Version 0.84.1; Gamer, Lemon, & Singh, 2019), lme4 (Version 1.1.31; Bates, Mächler, Bolker, & Walker, 2015), mice (Version 3.15.0; van Buuren & Groothuis-Oudshoorn, 2011), mitml (Version 0.4.4; Grund, Robitzsch, & Luedtke, 2021),papaja (Version 0.1.1; Aust & Barth, 2022), and tidyverse(Version 1.3.2; Wickham et al., 2019).

Scoring and Indices.

First, we calculated questionnaire scores and aggregated trauma and intervention indices. For the BAI, BDI-II, and GSI, scores were calculated as described in the corresponding manuals (sum scores for BAI and BDI-II, mean score for GSI). Additionally, the Reliable Change Index (RCI, Jacobson, Follette, & Revenstorf, 1984; Jacobson & Truax, 1991) was applied for BAI, BDI-II, and GSI to assess the number of patients who recovered or improved during therapy, as recommended by Loerinc et al. (2015). To calculate the RCI as Jacobson and Truax (1991) described, we first divided each patient’s change score by its standard error, resulting in a continuous change score with change values > 1.96 indicating reliable improvement and change values < -1.96 indicating reliable deterioration. We then categorized patients who reliably improved and fell below the cut-off score for clinically relevant symptom severity of the corresponding questionnaire as recovered, patients who reliably improved and did not fall below the cut-off score as improved, patients with reliable deterioration as deteriorated, and all others as unchanged.
For trauma, two indices were calculated. The first index, global trauma, differentiated between cases with at least one traumatic event and cases without any traumatic event. The second index, interpersonal trauma, differentiated between cases with at least one interpersonal traumatic event, cases without interpersonal traumatic event but at least one non-interpersonal traumatic event, and cases without any traumatic event. Diverging from the pre-registration for Hypothesis 3, we could not further differentiate between childhood interpersonal and adulthood-only interpersonal trauma because cases with adulthood-only interpersonal trauma were sporadic in the sample (2.35%).

Interrater Reliability.

Second, we assessed agreement percentages and interrater reliabilities (Cohen’s Kappa or Fleiss Kappa, as appropriate) for all trauma and intervention ratings and all rater pairs. Regarding the trauma ratings, the aggregated global and interpersonal trauma indexes achieved satisfying interrater reliabilities (\(\kappa\) = 0.62-1), as did the number of traumatic events rating (ICC = 0.81). Some trauma subtypes, in contrast, had insufficient reliabilities (\(\kappa\) = -0.05-1). For 52.78% of all trauma subtypes, interrater reliability was unacceptably low (\(\kappa\) < .61), so trauma subtypes could not be used for exploratory analysis as initially intended. Consequently, all further analyses were based on the global and interpersonal trauma indexes.
Regarding the intervention ratings, the aggregated intervention index achieved satisfying interrater reliability (\(\kappa\) = 0.78). The intervention subtypes generally had satisfying interrater reliabilities as well (\(\kappa\) = 0.38-1), except for the rarely used trauma-focused interventions (\(\kappa\) = -0.01) and other, not further specified interventions (\(\kappa\) = 0). Since the interrater agreement was acceptable for all intervention types (70.00-93.10%), we included all intervention subtypes in the analysis as pre-registered to provide a more detailed picture. The interrater reliability and agreement for all indices and subtypes used for analyses can be found in the Supplemental Material, Tables 5 and 6.

Treatment of Missing Data.

Third, a significant challenge for data analyses was the substantial percentage of missing values, primarily due to post-treatment symptom severity questionnaires missing in the original files. Whereas trauma and intervention ratings had no missing values and missingness was low on sociodemographic variables, treatment-related variables, and pre-treatment BDI-II and BAI questionnaires, at post-treatment, both BDI-II (37.06% missingness) and BAI (43.24% missingness) suffered from missing data. Missingness on the GSI was even higher (pre-treatment: 23.24%, post-treatment: 65.00%). For further information on missing values, see Supplemental Material, Table 7.
We applied Multiple Imputation (Rubin, 1987) for missing BAI and BDI-II values to minimize bias due to missing values during hypotheses testing. We used fully conditional specification and a two-level imputation model for longitudinal data, including all model-relevant variables and their interaction terms as well as auxiliary variables with a correlation of at least r \(\geq\) .2 with either the imputed variable or missingness (Kleinke, Reinecke, Salfrán, & Spiess, 2020; Spiess, Kleinke, & Reinecke, 2021; van Buuren, 2018). Fifty imputed data sets were generated with 20 iterations each (White, Royston, & Wood, 2011). For the GSI, the percentage of missing values at post-treatment exceeded 50%, making it prone to estimation errors even using Multiple Imputation (Barzi & Woodward, 2004; Marshall, Altman, Royston, & Holder, 2010). Hence, contrary to the preregistration, we excluded the GSI from further analysis.

Hypothesis Testing.

Finally, to investigate differences between the trauma groups in categorical outcome measures, e.g., trauma-specific interventions and Reliable Change Index, Fisher’s exact tests were conducted and pooled in case of missingness as proposed by Eekhout, Wiel, and Heymans (2017). Differences between the two trauma groups in continuous outcome variables were tested using between-subjects t -tests. Differences between the trauma groups, the depression groups, and the intervention groups in symptom severity trajectories were tested using the recommended pooling procedure for multi-parameter pooling of mixed designs described by van Ginkel and Kroonenberg (2014). We deviated from the pre-registered mixed-effects ANOVA models because of the multiple imputation context. Instead, we compared linear mixed effects models with and without the time, trauma, depression, and intervention factors and applied the D1 pooling method (Grund, Lüdtke, & Robitzsch, 2016). Full models always contained the factors of time point and trauma and their interaction in the fixed part and contained random intercepts by patient. The full model for the effect of the depression group (Hypothesis 2c) additionally contained a depression factor and its interaction effects. The full model for the effect of trauma-specific interventions (Hypothesis 4) contained an intervention factor and its interaction effects and number of therapy sessions as a covariate. In line with Ludbrook (2013), we used one-sided tests for pre-registered hypotheses. The standard significance level of p < .050 was set for all tests. The Bonferroni method was applied to adjust for multiple comparisons in case of multiple testing.