Assessment Methodology

How Codicier evaluates papers. This methodology is public and auditable. Changes go through a governed process with full version history.

Dual-Model Assessment

Every paper is assessed independently by two AI models from different providers. Currently: Claude Sonnet 4.6 (Anthropic) and GPT-4.1 (OpenAI). Both receive the same structured prompt and produce independent analyses.

Model versions are pinned and logged. Every assessment records which specific model version produced it. When models are updated, we run comparison tests before switching.

Hard Gates (must pass to publish)

Methodology Present and Coherent

The paper must describe a method that connects to its claims. For empirical papers: experimental design, data collection, analysis approach. For surveys: systematic review methodology (search strategy, inclusion criteria, synthesis approach). For theoretical papers: formal framework or proof structure.

Citations Exist and Are Verifiable

Referenced works must be real publications. Citations are mechanically verified against Crossref and Semantic Scholar (not by the AI models). Future-dated references, preprint sources, and self-citation clusters are flagged.

Discernible Research Contribution

The paper must advance knowledge beyond summarizing existing work. There must be a specific claim that could, in principle, be tested, falsified, or built upon. For surveys: the synthesis must generate insight (gaps, contradictions, trends), not just list what others have done.

AI-Written Threshold

AI-generated content is permitted. Visibly AI-written output (tell clustering, generic phrasing patterns, uniform sentence structure) is flagged for revision. This is a quality signal, not a ban on AI tools.

Soft Scores (published alongside paper)

Six dimensions, each scored 1-10 by both models independently. The median paper at a reasonable venue is a 6. Most papers score between 4 and 7. Scores of 8+ require explicit justification.

Methodological Soundness

Are the methods appropriate for the claims? Are assumptions justified? Are limitations acknowledged?

Internal Consistency

Do conclusions follow from the data? Are there contradictions between sections? Does the abstract match the results?

Citation Thoroughness

Does the paper adequately cover relevant prior work? Are there obvious omissions?

Novelty Signal

How does this paper's contribution relate to existing work? A 6 is 'someone would have predicted this.' An 8 is 'this reframes something.'

Reproducibility

Could another researcher replicate this work from what's provided? Code, data, parameters, and procedures.

Clarity and Structure

Organization, prose quality, and figure/table effectiveness. Evaluated as three sub-components, reported as one composite score.

Disagreement Flagging

When the two models diverge by 2+ points on any soft score, or disagree on a hard gate, that divergence is flagged visibly on the paper page. This is an honest signal of uncertainty, not a defect.

How the Methodology Improves

This page describes what Codicier evaluates and how the process works. The specific review prompts, scoring calibration, and detection heuristics are internal. The published assessments themselves are the transparency mechanism: anyone can read what two independent models said about any paper and judge the quality of the analysis directly.

If you have a suggestion for how the review process should work differently, contact us at methodology@codicier.io. Useful suggestions include dimensions that should be added or redefined, edge cases the process handles poorly, or discipline-specific concerns. All methodology changes are logged below with rationale.

Changelog

DateChangeRationale
2026-04-08Scoring calibration overhaulBoth models showed optimism bias, clustering scores at 7. Added calibration anchors with a median expectation of 6.
2026-04-08Paper type classificationSurvey papers were failing the methodology gate, which was designed for empirical research. Review now adapts criteria by paper type.
2026-04-08Clarity dimension splitModels were weighting different aspects (prose vs. structure), causing unexplained divergence. Now evaluated as three explicit sub-components.
2026-04-08Initial methodologyFirst structured review prompt, tested against two papers.