Chapter 7 Data Standards and Analysis Readiness

From CRF Design to CDISC: Where Statistical Quality Truly Begins

In clinical trials, the quality of statistical analysis does not begin at database lock (DBL).
It begins much earlier—at the moment the first Case Report Form (CRF) field is designed.

Many projects that struggle during analysis do so not because of incorrect models or software issues, but because the data were never designed to be analyzable in the first place.

This chapter focuses on the practical responsibilities of a Project Biostatistician in the data standards and analysis preparation phase, emphasizing a core principle:

The statistician’s job is not only to analyze data, but to ensure that the data are born analyzable.


7.1 Why Must Biostatisticians Participate in CRF Design Review?

7.1.1 A Hard Industry Truth

CRF design is the upstream source of all future TFLs.

If the CRF has structural problems—such as unclear timing, inconsistent units, ambiguous definitions, or broken logical relationships—then:

  • SDTM mapping becomes patchwork-driven
  • ADaM derivations become overly complex and fragile
  • TFLs face repeated QC findings
  • CSR interpretations become defensive and less credible

When statisticians are absent during CRF design, analysis teams are forced to compensate for design flaws using statistical assumptions—a risky and often irreversible situation.


7.1.2 The Statistician’s Role in CRF Review

The statistician is not responsible for: - Page layout - Font size - User interface aesthetics

Instead, the statistician must answer one critical question:

Can these CRF fields support the analyses defined in the SAP?


7.2 Four Core Dimensions of “Analyzability” in CRF Review

7.2.1 Data Collection Frequency: Does It Support the Planned Model?

This is one of the most underestimated yet critical review points.

Statisticians must verify alignment between: - Study endpoints (single time point vs. longitudinal) - Planned methods (ANCOVA, MMRM, time-to-event, responder analysis) - Actual visit structure and timing in the CRF

Key risks include: - Irregular visit spacing undermining covariance assumptions - Missing actual assessment dates preventing window definitions - Absent critical visits reducing the effective analysis population

The correct question is not “Is this visit collected?” but rather:

Does this collection frequency sustain the statistical model described in the SAP?


7.2.2 Units, Ranges, and Logical Validity

7.2.2.1 (a) Units: Simple but Dangerous

Common pitfalls: - Allowing multiple units for a single parameter - Missing linkage between unit and normal range - Free-text unit fields

The statistical concern is not conversion difficulty, but: - Increased SDTM mapping risk - Inconsistent derivation logic - Audit challenges on data consistency

Best practice:
One analysis variable → one analysis unit.


7.2.2.2 (b) Ranges: Supporting Outlier Detection

Statisticians should review whether: - Physiologically reasonable ranges are defined - Extreme values are recorded rather than blocked

A critical principle:

Statistical analysis can handle outliers—but cannot handle outliers that were never captured.

Overly restrictive range checks may eliminate clinically real but extreme observations, undermining sensitivity analyses and robustness checks.


7.2.2.3 (c) Logical Consistency

Key checks include: - Temporal logic (e.g., dosing date vs. assessment date) - Conditional field completeness - Avoidance of structurally missing data

A classic example:

If an assessment is not performed due to an adverse event, does the CRF capture the reason for missingness?

Without this: - Missing data mechanisms (MCAR, MAR, MNAR) cannot be justified - SAP missing data strategies lack empirical support


7.3 Confirming the CDISC Strategy (SDTM and ADaM)

7.3.1 The Statistician as a Standards Gatekeeper

While statisticians may not write SDTM or ADaM code, they must be involved in confirming: - Adoption of SDTM and ADaM - Sponsor-specific standards - Reuse of legacy structures from prior studies

Standards strategy is a cross-functional decision, not solely a DM or programming task.


7.3.2 Key SDTM Considerations from a Statistical Perspective

Statisticians should verify: - Proper domain placement for primary endpoints - Stable visit mapping (VISIT / VISITNUM) - Sufficient timing variables to support analysis windows

A useful rule of thumb:

If you can already visualize the SDTM structure in your head, your involvement timing is appropriate.


7.3.3 Early ADaM Thinking

Before CRF finalization, statisticians should consider: - Complexity of primary endpoint derivations - Dependency on multiple SDTM domains - Definitions of analysis populations (ITT, FAS, PPS)

If essential derivation inputs: - Do not exist in the CRF, or - Are fragmented across poorly defined fields

Then ADaM specifications will become increasingly verbose and analytically fragile.


7.4 Alignment with Data Management: Collaboration, Not Handoff

7.4.1 Key Topics for Statistician–DM Alignment

Statisticians should proactively align with Data Management on: - Key analysis variable lists - CRF fields supporting primary and key secondary endpoints - Expectations for handling missing, duplicate, or abnormal data

This alignment is not a formality—it establishes interpretation consistency across SAP, ADaM, TFLs, and CSR.


7.4.2 A Sign of a Mature Project Team

In well-functioning teams: - DM asks statisticians during CRF design:
“Will this field be used for analysis?” - Statisticians can clearly articulate:
“If this field is missing, primary endpoint interpretation will be compromised.”

This reflects maturity of responsibility, not authority.


7.5 Chapter Summary: The Statistician’s True Value at This Stage

At the data standards and analysis preparation stage, a biostatistician’s value is not measured by: - Lines of code written - Number of tables produced

It is measured by: - Structural problems prevented before DBL - Analytical credibility preserved for regulators and stakeholders

Key takeaway:

Biostatisticians are not merely data users—they are co-designers of data quality and analyzability.

If the first time you examine data structure is after CRF finalization, the project is already one step behind.