Jan 2026 · Method Reflection

Research infrastructure is an AI problem too

Metadata, synchronization, dataset release strategy, and protocol traceability are not peripheral to AI. They determine whether a result can scale beyond the first paper.

Research question. When does infrastructure stop being support work and start becoming part of the technical object itself?

In computational research, a published result is not just a number or a figure. It is the output of a transformation pipeline applied to specific data with a specific environment. A compact way to state this is:

Reproducibility Relation
\[R = F(D, C, P, E)\]

Here \(D\) is data, \(C\) is code, \(P\) is parameters, and \(E\) is the execution environment. If any of these are missing, the result \(R\) becomes harder to audit or reproduce. Peng argued that reproducibility should be treated as a minimum standard for judging computational claims, particularly when independent replication is expensive or impossible [1].

My own view of this problem has been shaped less by abstract tooling debates and more by protocol-centred human studies such as HERO and HEROx. In that setting, infrastructure decides whether protocol phases, participant records, signal streams, feature tables, and publication outputs remain connected. If that chain breaks, the science becomes harder to extend even when the paper itself is technically sound.

Why FAIR and provenance belong together

The FAIR principles were designed to make data findable, accessible, interoperable, and reusable [2]. In practice, that means a dataset should not be treated as a pile of files. It needs identifiers, metadata, and structures that let someone else understand what was measured, under what conditions, and how it connects to analysis artifacts. FAIR alone does not solve reproducibility, but without FAIR-like structure, reproducibility remains fragile.

In protocol-driven work, provenance is not an afterthought. It is the link between what participants experienced, what sensors recorded, what preprocessing changed, and what finally appeared in the figure or model input. Sandve and colleagues pushed this argument further in computational terms: every result should remain connected to the exact commands, versions, and inputs that produced it [3]. That is not bureaucracy. It is how a research pipeline becomes inspectable.

A useful way to think about this is as a traceability graph:

Traceability Graph
\[G = (A, L), \qquad A = \{protocol, raw, qc, features, figures, paper\}\]

The links \(L\) are only trustworthy if each stage preserves identifiers, timestamps, and transformation history. That is exactly the sort of structure that makes a dataset or protocol reusable by someone who was not present during collection.

What good infrastructure looks like

Good infrastructure is concrete. It means versioned code, data manifests, environment capture, protocol identifiers, timestamp integrity, and analysis logs that can be traced from raw inputs to figures and claims. For a research group, that often matters more than one extra percentage point of benchmark performance, because it determines whether a result can survive collaboration, extension, or peer review.

That is why I treat dataset releases and protocol publications as technical outputs rather than administrative supplements. The HERO and HEROx protocols formalize the collection logic. The HERO and HEROx dataset releases formalize the data object that follows from that logic. Together they make the research legible beyond the original team.

This is why I see research infrastructure as part of the AI stack. If an intelligent system is only as trustworthy as the chain of decisions behind it, then provenance, metadata, and workflow design are not support services around the model. They are part of the model’s epistemic foundation.

References

  1. Peng RD. Reproducible Research in Computational Science. Science, 2011.
  2. Wilkinson MD et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 2016.
  3. Sandve GK et al. Ten Simple Rules for Reproducible Computational Research. PLoS Computational Biology, 2013.
  4. Tomar P et al. Human Experience in Regulated Offices (HERO) protocol. Protocols.io, 2026.
  5. Tomar P et al. Human Experience in Regulated Offices Extended (HEROx) protocol. Protocols.io, 2026.
  6. Tomar P et al. Human Experience in Regulated Offices Extended (HEROx) dataset. Zenodo, 2026.