Transparent reporting in behavioral science studies involving large language models (LLMs) demands rigorous standards to ensure replicability, interpretability, and ethical compliance. Researchers must meticulously document model selection criteria, including architecture specifics, training data provenance, and fine-tuning methodologies. Disclosure of prompt design and preprocessing pipelines is equally vital, as subtle variations can significantly influence outcomes. Furthermore, detailed reporting on evaluation metrics – beyond simple accuracy figures – such as consistency, bias evaluation, and error analysis, provides a multidimensional perspective on model performance.

  • Model transparency: Specify version, parameters, and training corpus characteristics.
  • Data lineage: Describe all input datasets, including sources, annotations, and preprocessing steps.
  • Prompt engineering: Present prompt templates and any iterative tuning strategies clearly.
  • Evaluation rigor: Report comprehensive metrics and disclose potential failure modes.
  • Ethical considerations: Address biases, consent, and privacy implications explicitly.
Reporting Aspect Key Details Impact
Model Version GPT-4, 175B parameters Ensures replicability of outputs
Training Data OpenWebText, Common Crawl Determines bias and coverage
Prompt Description Standardized query templates Allows assessment of input influence
Evaluation Metrics Accuracy, Fairness scores Multifaceted performance insights
Ethical Review Bias audit reported Enhances trustworthiness