Why the Future of AI Will Depend on Better Human Data

Research question. What kind of data layer is required for future systems that are expected to understand people, context, and change over time?

Most AI discussions still focus on architecture, scale, and inference speed. Those matter, but in human-aware systems the bottleneck often appears earlier, at the level of the dataset itself. A model cannot recover what the acquisition protocol never measured and cannot disambiguate what the metadata never described. This becomes visible in studies where physiology, perception, and environment interact continuously rather than behaving like isolated variables. My own work on office studies, thermal adaptation, and comfort-oriented datasets keeps returning to the same point: the decisive technical problem is not only prediction, but representation of the human condition as a structured temporal object.

The multimodal machine learning literature has already framed fusion, representation, and alignment as core technical problems rather than peripheral engineering tasks [1]. In practice, that means moving beyond single-label datasets toward designs where physiological streams, subjective reports, and environmental descriptors remain linked. A useful abstraction is:

State Formulation

x_t = [p_t, b_t, e_t], \qquad y = g(x_1, \ldots, x_T)

Here $$p_t$$ is physiological state, $$b_t$$ is behavioural or perceptual state, and $$e_t$$ is environmental context. The learning problem is not classification on a single frame but inference over a coupled trajectory. If one of those components is weakly captured or badly aligned, the downstream estimate $$y$$ becomes unstable.

Why multimodality is not optional

Human-centred systems rarely fail because one signal is missing. They fail because the relation between signals is underspecified. A heart-rate change may indicate stress, thermal load, posture change, or task demand; the same electrodermal fluctuation can be interpreted differently depending on ambient conditions, time-of-day, or protocol stage. Without environmental context and timing integrity, a physiological feature is easy to overread.

This is exactly why datasets such as HEROx, HERO, and COC matter to me as more than publication outputs. They represent attempts to keep subjective reports, physiological traces, and surrounding context in the same analytic frame. Better human data therefore means more than collecting more channels. It means preserving synchrony, capturing provenance, and attaching metadata that explains how the signals were produced, filtered, segmented, and validated. The FAIR principles are relevant here not as administrative ideals but as conditions for real downstream reuse [2].

The technical frontier behind model quality

In human-aware AI, model quality depends on dataset structure more directly than many workflows admit. A technically serious dataset should preserve at least four things: time, context, provenance, and uncertainty. Time matters because many state estimates depend on change rather than absolute value. Context matters because physiological and behavioural signals are not self-explanatory. Provenance matters because rerunning a pipeline should reproduce the same transformations. Uncertainty matters because missingness, artifact rejection, and windowing decisions all alter the statistical object being learned.

The PPD study on physiological-perceptual divergence in controlled indoor exposure sharpened this issue for me. Once perception and physiology are measured together, it becomes obvious that human state cannot be reduced to a single proxy. A compact way to express that mismatch is with a divergence term between normalized physiological and perceptual trajectories:

Perceptual-Physiological Divergence

d_t = \lVert z_t^{phys} - z_t^{perc} \rVert_2

Even when this is used only as a conceptual statistic, the implication is strong: human-aware datasets should preserve disagreement, not average it away. That is why reproducibility is not just a publication concern. Peng argued that reproducibility functions as a minimum standard for judging computational claims when full independent replication is not feasible [3]. Sandve and colleagues made the same point operational: every result should remain connected to the exact steps, parameters, and data that produced it [4].

What a research-grade human dataset should contain

If I were defining a baseline for future-facing human datasets, I would ask for raw streams, synchronized timestamps, protocol states, intermediate quality-control outputs, feature definitions, environment descriptors, and versioned analysis manifests. That bundle is what makes a signal usable beyond its first experiment. Without it, the dataset behaves like a one-off benchmark. With it, it becomes infrastructure.

The practical implication is simple: the next leap in human-aware AI will not come only from better models. It will come from better structured human data, of the kind now emerging in office comfort studies, thermal adaptation experiments, and multimodal datasets designed for reuse rather than one-time reporting.

Why the future of AI will depend on better human data

Why multimodality is not optional

The technical frontier behind model quality

What a research-grade human dataset should contain

References