Methodology & Limitations

Last updated: May 31, 2026

Overview

TriSift is a title-and-abstract screening accelerator for systematic reviews. It helps research teams triage large citation libraries quickly by applying three independent AI reviewers to each record and surfacing a consensus decision together with the reasoning behind it.

TriSift screens titles and abstracts only. It does not read or assess full-text articles, extract data, assess risk of bias, or perform any later stage of a systematic review. It is a tool to speed up the first screening pass — not a replacement for the reviewers, the protocol, or the published reporting standards that govern a systematic review. Every decision TriSift makes is advisory and reversible by the research team.

The three-reviewer model

TriSift mirrors the two-reviewer structure recommended by Cochrane and PRISMA, then adds an independent third model to resolve disagreements.

  • Reviewer 1 and Reviewer 2 screen every record independently. They run on models from two different AI vendors and receive the same neutral instructions — the same inclusion and exclusion criteria, the same output format, and the same requirement to ground each decision in a quote from the record. Using two vendors reduces the chance that a single model's blind spot is silently shared by both reviewers.
  • Reviewer 3 (the methodology judge) is invoked only when the first two reviewers disagree or are uncertain. It receives both reviewers' decisions, reasoning, and supporting quotes, and applies PRISMA 2020 / Cochrane screening principles to reach an independent ruling. It is not a simple vote-counter: it can side with either reviewer or disagree with both.

Each reviewer returns a decision (INCLUDE, EXCLUDE, or UNCERTAIN), a confidence level, written reasoning, and the criteria it matched. Supporting quotes are checked against the actual record text; a quote that cannot be found in the title or abstract is flagged, so unsupported claims do not pass unnoticed.

The exact models used are admin-configurable and are shown under Current models at the top of this page.

Consensus and conflict resolution

TriSift resolves the two independent reviews as follows:

  • Both INCLUDE → agreed include.
  • Both EXCLUDE → agreed exclude.
  • One INCLUDE, one EXCLUDE → conflict → referred to the methodology judge.
  • Any UNCERTAIN → referred to the methodology judge.

Records the methodology judge cannot resolve confidently are marked NEEDS REVIEW and surfaced to the team rather than being silently included or excluded. At every stage the research team can inspect the full reasoning and override any decision. TriSift never hides a borderline call.

Deduplication

Before screening, TriSift removes duplicate records that commonly arise when the same study is exported from several databases:

  1. DOI match — records sharing a DOI (compared case-insensitively, with surrounding whitespace removed) are treated as duplicates.
  2. Title similarity — when DOIs are absent, titles are normalised (lower-cased, punctuation and extra whitespace removed) and compared using Levenshtein edit distance. Records are merged only above a high similarity threshold (about 0.92, raised to about 0.97 for very short titles) to avoid collapsing genuinely distinct papers.

Records at different metadata levels — one with a DOI, one without — are not merged on title alone, which guards against false merges.

Inter-rater reliability (Cohen's κ)

To make screening agreement transparent, TriSift automatically computes Cohen's kappa (κ) between Reviewer 1 and Reviewer 2 across the INCLUDE / EXCLUDE / UNCERTAIN categories, using the standard formula κ = (po − pe) / (1 − pe), where po is the observed agreement and pe is the agreement expected by chance.

κ is interpreted using the bands from Landis & Koch (1977): below 0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect.

κ has well-known limitations, and TriSift reports them honestly. It requires a reasonable sample, so TriSift suppresses κ below ten jointly-screened records, and it can be distorted when one category dominates — the kappa paradox. TriSift therefore also reports the raw observed agreement and a prevalence-adjusted, bias-adjusted kappa (PABAK), and flags any run where a single category exceeds 80% of decisions.

PRISMA 2020 alignment

TriSift generates a PRISMA-style flow diagram from your actual results: records identified, duplicates removed, records screened, records excluded, and studies retained, with the inter-rater κ shown alongside.

This diagram covers the title/abstract screening sub-process only. A complete PRISMA 2020 flow also documents full-text eligibility assessment and the reasons for full-text exclusions — stages that happen outside TriSift. The generated diagram is labelled accordingly and is intended as a starting point that your team completes with the later stages of the review.

Limitations

  • AI reviewers are not infallible. They can misread abstracts, miss relevant studies, or include irrelevant ones. TriSift is designed to reduce these errors through independence and consensus, not to eliminate them.
  • No guaranteed sensitivity. TriSift does not promise that every relevant paper will be retained. The calibration round and the NEEDS REVIEW queue exist precisely because borderline judgements require human verification.
  • Criteria quality matters. Screening is only as good as the inclusion and exclusion criteria you provide. Vague criteria produce vague decisions.
  • Models change. The underlying models are configurable and evolve over time, so behaviour can shift between runs. The model and prompt used for each decision are logged for reproducibility.
  • Verify before relying. Treat TriSift's output as a first pass to be checked — especially for any study where the decision was borderline or the reviewers disagreed.

Not medical or clinical advice

TriSift is a research-workflow tool. It is not certified for clinical, diagnostic, or regulatory decision-making, and its output is not medical advice. AI decisions are advisory and are not a substitute for trained human judgement. Researchers remain responsible for the final inclusion and exclusion decisions in their review, and for compliance with their own institutional and publication standards.

Evidence base

TriSift's design follows evidence that combining multiple independent language models improves citation-screening performance over a single model:

Oami T, Okada Y, Nakada T. Optimal large language models to screen citations for systematic reviews. Research Synthesis Methods. 2025;16(6):859–875.

In that study, a single model reached roughly 78% sensitivity, two models in combination roughly 88%, and the addition of a tiebreaking model roughly 94%. These figures describe the published study's experimental setup — they are not TriSift's own validated performance. TriSift has not yet completed and published an independent validation against gold-standard human screening; until it does, these numbers should be read as motivation for the multi-model design, not as a performance guarantee for your review.

The κ interpretation bands above follow:

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–174.