Combining biomarker and food intake data

Recent developments in biomarker discovery have demonstrated that combining biomarkers with self-reported intake data has the potential to improve estimation of food intake. Here, statistical methods for combining biomarker and self-reported food intake data are discussed. The calibration equations method is a widely applied method that corrects for measurement error in self-reported food intake data through the use of biomarker data. The method is outlined and illustrated through an example where citrus intake is estimated. In order to estimate stable calibration equations, a simulation-based framework is delineated which estimates the percentage of study subjects from whom biomarker data is required. The method of triads is frequently used to assess the validity of self-reported food intake data by combining it with biomarker data. The method is outlined and sensitivity to its underlying assumptions is illustrated through simulation studies.


Introduction
Measuring what human subjects have consumed is fundamental in nutrition science. Typically food intake data are collected using self-reported tools or instruments such as food diaries, food frequency questionnaires (FFQ) and 24-hour recalls. These selfreported methods have inherent measurement error as they rely on the accurate documentation or recollection of subjects' food consumption over the specified time period [Bingham et al., 1995, Carroll et al., 1998, Heitmann, 1995, Johansson et al., 1998, Kirkpatrick and Collins, 2016, Ocké, 2013, Poppitt et al., 1998, Pryer et al., 1997, Thürigen et al., 2000. Issues such as energy underreporting, recall errors and portion size assessment errors are well established , Kipnis et al., 2002. Such measurement errors result in reduced power and underestimated associations, the propagation of which into subsequent analyses raises concern about inference validity and has been associated with inconsistent and confusing conclusions [Dhurandhar et al., 2015, Marshall andChen, 1999].
To overcome these issues in self-reported food intake data, the concept of dietary biomarkers has emerged. Dietary biomarkers (see stat07590) are found in biological samples and are small molecules called metabolites that can provide information on the level of intake of a nutrient or a food [Gao et al., 2017]. For example, salt, protein, sugar and energy intake have been assessed through the use of sodium, urinary nitrogen, sucrose and doubly labelled water biomarkers, respectively [Kipnis et al., 2002, Bingham and Day, 1997, Kipnis et al., 2003, Tinker et al., 2011, Tooze et al., 2004. Data emerging from the application of metabolomics to biomarker discovery has revealed that objective food intake biomarkers may be used to estimate food intake [Poppitt et al., 1998, Pryer et al., 1997, Thürigen et al., 2000. While biomarkers offer the potential to overcome some of the limitations of self-reported data, they will not replace it completely. Thus, there is a need for apposite statistical methods that combine dietary biomarker data with self-reported food intake data to ensure accurate assessment of food intake.
There is a rich literature in which biomarker and self-reported intake data have been combined to assess nutrient or food intake. Gormley et al. [2020] review this literature. Prentice et al. [2020] demonstrate that FFQ data may usefully augment blood concentrations when estimating carotenoids and tocopherol intake, and when assessing intake at a macronutrient (carbohydrate and protein) level [Prentice et al., 2021]. Biomarker and intake data have also been combined in more complex settings such as the use of multiple biomarkers for one nutrient [Tasevska et al., 2011] or food [D' Angelo et al., to appear], when assessing intake of infrequently consumed foods [Geelen et al., 2015, Kipnis et al., 2009 and settings in which repeated measurements are used in conjunction with biomarkers [Rosner et al., 2008]. Thus, in recent years there has been an increased interest in the combination of intake data and dietary biomarkers due to the latter's potential to assist in accurately and objectively assessing dietary intake [Keogh et al., 2013, Prentice et al., 2002.
We provide the key concepts that characterise the methods used to combine biomarker and self-reported food intake data. Section 2 focuses on the statistical details underpinning the calibration equations method that uses a biomarker to address the error in associated self-reported intake data. An illustrative example is provided through the development of calibration equations for citrus intake using urinary proline betaine as a biomarker and data from 4-day semi-weighed food diaries. As acquisition of biomarkers from all study subjects may not be feasible due to high costs and limited biological sample availability, Section 2.2 reviews an approach to estimating the quantity of biomarker data required to develop accurate calibration equations. Section 3 focuses on the method of triads (MoT), a popular approach to estimating the validity of a set of self-reported food intake data which requires associated biomarker measurements. Where applica-ble, details are provided for software that facilitate the practitioner when combining biomarker and self-reported food intake data. Measurement errors [Keogh and White, 2014] can be classical, systematic, heteroskedastic or differential, and the error form may be related to the outcome of interest. Here it is assumed that the measurement errors follow the classical measurement error framework, i.e. the truth is measured with additive error, usually with constant variance. Calibration equations (see stat01565) are used to correct for inherent classical measurement error in data from a self-reported dietary intake instrument, by combining biomarker data and self-reported dietary intake data. Biomarker and self-reported data generated in a small sub-sample of a larger study are used to develop the calibration equations, with the results then employed to account for measurement error in the self-reported data from the remainder of the study's subjects.

Calibration equations
The calibration equations method assumes that data derived from the self-reports and biomarkers of study subjects i = 1, . . . , n are linearly related to the "true" but unknown intake X i , with additive Gaussian errors. As the true intake is unknown, directly estimating these relationships is not feasible. In what follows the biomarkerderived estimate of intake for study subject i is denoted M i and the self-reported intake by W i . The calibration method first regresses the biomarker data on the self-reported data and then derives a conditional predictionX i of the unknown intake given the estimated model parameters and the self-reported data i.e.
These are the so-called 'calibration equations' and the predicted valuesX i are the calibrated self-reported intakes. The prediction equation (1) can also be applied for subjects for whom M i is not observed. The calibration equations hold under the crucial assumption that the Gaussian errors (see stat01072) in the linear relationships between the self-reported data and the true intake, and between the biomarkers and the true intake are uncorrelated [Carroll et al., 2006, Keogh et al., 2013, Gormley et al., 2020. Another assumption is that the sample on whom M i is recorded is a good representation of those for which the prediction equation (1) will be applied. In some scenarios it may be advantageous or necessary to incorporate covariates in the calibration equations; in such cases the biomarker data are regressed on the self-reported and covariate data with the resulting regression coefficients employed to derive the estimated true intakes. Incorporating covariates may introduce a requirement to choose the set of influential covariates and this model selection process will introduce additional uncertainty. A number of model selection tools are available, for example the Akaike Information Criterion [Akaike, 1974] or the Bayesian Information Criterion [Kass and Raftery, 1995], but the set of covariates deemed influential under them may vary.

Calibration equations for citrus intake
We illustrate the calibration equations by estimating citrus intake for participants in the Irish National Adult Nutrition Survey using urinary proline betaine as the biomarker [Gibbons et al., 2017, D'Angelo et al., 2019. Proline betaine has been demonstrated to be a good biomarker of citrus intake, with correlation of 0.9 with actual citrus intake and accurate estimation of total citrus intake in a cross-sectional population [Gibbons et al., 2017, Garcia-Perez et al., 2017, Pujos-Guillot et al., 2013, Heinzmann et al., 2010, Lloyd et al., 2011. Calibration equations are developed using mean daily self-reported citrus intake and biomarker-derived intake data. In detail, a cross-sectional study collected food consumption data using 4-day semi-weighed food diaries from N = 1500 subjects. The mean daily citrus intake (average citrus intake based on the 4 days of recording) in grams per day (g/day) was estimated for the total citrus food group for each subject. For a subsample of n = 565 of the N subjects, levels of the biomarker proline betaine were also collected and citrus intake estimated from the biomarker values. Calibration equations were applied to the self-reported data and biomarker estimated intake data for these n subjects. The results,β 0 andβ 1 , were then employed to correct self-reported mean daily citrus intake for the remaining (N − n) = n * = 935 subjects. D'Angelo et al.
[2019] explored a range of transformations of the biomarker-derived intake data, and zero-inflated models, and compared them with the calibration equations' standard linear regression framework. The optimal model specification was indicated by the lowest average mean squared error (MSE) between biomarker-derived estimates of intake and calibrated mean-daily self-reported intake of the n subjects, using a crossvalidation approach. The lowest MSE was achieved by the standard linear regression model on the original biomarker-derived estimates of intake i.e. the estimated calibration equation for mean-daily calibrated citrus intakeX i for subject i, given their self-reported data W i , was derived to be:X i = 33.60 + 0.63W i . Figure 1 illustrates calibrated and self-reported mean daily citrus intakes for the n * subjects on which only self-reported data are available. There is good agreement between self-reported and calibrated mean daily intakes with some discrepancies emerging at high intakes. Addressing measurement error in self-reported mean daily citrus intake data by the wider community is facilitated by the associated 'Bio-Intake' web application based software, available at adiet.shinyapps.io/Bio-Intake.

How much biomarker data is required?
Acquiring biomarker-derived estimates of intake for all subjects in a study may not be feasible due to financial constraints and limited biological sample availability. Thus, in order to develop calibration equations, both biomarker and self-reported data are collected on a random sample of the subjects, with self-reported data only recorded on the remaining subjects. A simulation-based framework can be used to assess the percentage of subjects from whom biomarker data should be collected in order to estimate stable calibration equations [D' Angelo et al., 2019]. Based on the citrus intake data, true intake values were simulated from a Gaussian distribution, with mean and variance set to be the empirical mean and variance of the calibrated mean daily self-reported citrus intakes. Self-reported and biomarker-derived intakes were then constructed from the simulated true intakes, following the classical measurement error framework for calibration equations. Simulated self-reported and biomarker intakes were considered to have either moderate or high variability around the simulated true intakes; the parameters α W and α M denote this variability for the self-reported and biomarker-derived intakes, respectively. In a similar vein, different strengths of the relationship between self-reported and true intakes were considered by varying a parameter β W , with larger values indicating higher-quality self-reported data. Of the N = 100, 000 subjects, biomarker-derived intakes are assumed available only for n g of the subjects where n g = 0.01gN and g = {1, 2, . . . , 99, 100}. For each value of n g , calibration equations were estimated from 100 sets of simulated biomarker and self-reported data, and the standard deviations of the estimated parameters were estimated. Stabilization of the parameter estimates after a particular n g would indicate that acquiring biomarker derived intakes from additional subjects beyond that value of n g would not yield a substantial improvement in the efficient estimation of food intake. Figure 2 illustrates the standard deviations of the estimated calibration equation slope parameterβ 1 under different, realistic settings of α W , α M and β W . The standard deviations stabilize quickly at relatively small values of n g , even in poor-quality selfreported data scenarios i.e. when β W = 0.1. In general, with N = 100, 000, the standard deviations of the calibration equation parameters tend to stabilize when biomarker data are obtained from 25% of the study population.

The method of triads
The validity coefficient of a set of self-reported intake data is its correlation with the true intake. The magnitude of any loss of statistical power or bias in inference based on the set of self-reported data relates to the validity coefficient. With its roots in factor and path analysis (see stat02421.pub2, stat06517) [Loehlin, 1998], and close links to structural equation models Shavlik, 2004, Kaaks, 1997], the method of triads (MoT) is often applied to estimate the validity coefficient of a set of self-reported food intake data by combining it with biomarker data.
As true intake is unobserved, the MoT estimates indirectly the correlation between the true and self-reported intake by using the correlations between three dietary measurements. Typically, for subject i the three dietary measurements are an FFQ (denoted W i ), a reference method such as 24-hour recall data (denoted R i ) and a biomarker measure (denoted M i ). The validity coefficient V C W X of self-reported data W and the unobserved true intake X is  [ Kaaks, 1997] where ρ M W , ρ W R and ρ M R denote the three pairwise correlations between the biomarker and the self-reported data, between the self-reported data and the reference and between the biomarker and the reference, respectively. These pairwise correlations are easily estimated from the three observed dietary measurements. Thus the MoT provides a quick and straight forward approach to estimating the validity coefficient without requiring the practitioner to estimate the specifics of the relationship between the self-reported data and true intake. Yokota et al. [2010] provide a review of the use of the MoT in the literature. The MoT has been used to assess the validity of an FFQ by combining it with different biomarkers and forms of reference data. For example, Kabagambe et al. [2001] use the MoT to assess the validity of an FFQ among Hispanic Americans by combining the self-reported data with biomarkers and 24-hour recall data. Fowke et al. [2002] employ the MoT on multiple 24-hour recalls, a food-counting questionnaire and urinary dithiocarbamates excretion levels when assessing cruciferous vegetable consumption. Many other studies combine biomarker and food intake data to assess the validity of the self-reported data [Dixon et al., 2006, McNaughton et al., 2005, 2007, Daures et al., 2000.
The validity coefficient in (2) is unbiased if the three measurements are pairwise linearly associated with the true intake and if the three measurement errors are mutually independent. The benefit of incorporating the biomarker data is that its errors are likely to be independent of those of the self-reported and the reference data; sources of error in biomarker data are likely to be very different to those in self-reported data of habitual intake. A simulation study provides insight to the sensitivity of the MoT approach to the assumption of measurement error independence across the dietary measurements. The MoT was used to estimate the validity coefficient in each of 500 simulated data sets with n = 1000, across varying levels of measurement error correlation. Uncorrelated errors were assumed i.e. ρ W R = 0, followed by three realistic settings of weak correlation i.e. ρ W R = {0.1, 0.3, 0.5}. In all settings both the reference and FFQ are assumed to have strongly positive and linear relationships with the true intake (see Gormley et al. [2020] for additional detail). Figure 3 presents the empirical distribution of the estimated validity coefficient in each of 500 simulated data sets for each of the correlated error settings. As expected, when the the self-report and reference errors are independent (ρ W R = 0), the MoT on average correctly estimates the validity coefficient with mean estimateV C W X = 0.579 (true value 0.58). However, the MoT increasingly overestimates the validity of the self-reported data as the correlation of the errors increases. Even at low correlation values, the validity is estimated with appreciable bias: for ρ W R = 0.1 the mean estimated validity coefficient is overestimated (p < 0.001). In such cases researchers will be over confident in the validity of their set of self-reported data. Even if good statistical practise is followed and the uncertainty in a validity coefficient estimate is also assessed (e.g. through the use of the bootstrap), the validity of self-reported data whose errors are correlated with the reference will be overestimated. When the errors are likely to be correlated, viewing the estimated validity coefficient as the upper limit of a range of possible values for the truth has been suggested [Dixon et al., 2006, McNaughton et al., 2005; such a view could provide researchers with a degree of overconfidence in the validity of their self-reported data. While it is reasonable to assume that the biomarker errors are independent of the selfreported and reference errors, it is less plausible for the errors between the self-reported data and the reference: as highlighted in Geelen et al. [2015] and shown in Figure 3, in such cases the MoT is biased. Fraser et al. [2005] address this issue by combining two biomarkers with self-reported data to estimate FFQ validity. Rosner et al. [2008] allow for correlated errors between self-reported and reference data in a repeated-measures setting. In general, when combining biomarker and self-reported food intake data to assess its validity, the MoT is a readily accessible and widely useful tool. However, its simplicity relies on key assumptions and the validity coefficient's own validity requires cognisance of these.

Discussion
An overview of the commonly applied methods for combining biomarker and food intake data has been provided. The calibration method in particular is widely used and is often the first step in nutritional epidemiology studies where the method's predictions are employed in subsequent models e.g. to assess diet-disease relations using regression calibration. However, valid use of the calibration method requires adherence to the underpinning assumptions of classical measurement error, linear relations of the dietary measurements with the true intake and independence of the associated Gaussian errors. The intake values predicted by the calibration method are less dispersed than the true intakes which introduces a bias to any subsequent analysis that uses the predicted values. For example, as in error-in-variable models (see stat05747), standard errors will be underestimated. Prentice et al. [2020Prentice et al. [ , 2021 highlight the importance of critical evaluation of the research methods employed when assessing intake, particularly in the context of nutritional epidemiology research. Drawing on the calibration equations framework, a simulation approach was also outlined to estimate the percentage of subjects from whom biomarker data should be collected, suggesting stable results when biomarker data are obtained from 25% of the study population.
Additionally, an overview of the popular MoT approach was provided. The MoT provides researchers with a tool that quantifies the validity of their self-reported food intake data in terms of its correlation with the unobserved dietary intake. The method requires that three dietary measurements are recorded on each subject; the validity coefficient of the self-reported data can be derived from the associated empirical pairwise correlation coefficients. The impact of even minimal violations of the MoT's underlying assumption of independence of the measurement errors was illuminated through a simulation study; ensuring independence of the measurement errors between the self-reported and reference data in particular should be prioritised by researchers.
Both the calibration method and MoT have to be explored further in the setting specific to the researcher's study. For example, systematic measurement errors, covariates and/or a longitudinal setting are likely to impact accuracy. While Spiegelman et al. [2005], Day et al. [2001], Kaaks et al. [1994], Kipnis et al. [1999] and Kipnis et al. [2001] address the issue of combining biomarker and self-reported food intake data in the presence of correlated measurement errors, this area is ripe for progressive and impactful research. The MoT simulation study presented here relied on specific parameter settings and similar exploration relevant to a researcher's study is recommended. Additionally the calibration and MoT methods assume that the self-reported data are (or can be transformed to be) normally distributed; using heavier tailed distributions such as the t or skew-Normal distribution may be more appropriate. All the potential issues highlighted here, in a relatively simple experimental design framework, may be exacerbated in more complex settings.
While the overview presented here focused on the calibration and MoT methods, the review is not comprehensive. Developing and combining biomarkers and self-reported food intake data is a burgeoning and active area of research. For example, combining biomarker and self-reported food intake data in the Bayesian inferential framework is feasible and intuitive given the context, and is likely to become more prevalent in the coming years. The biomarker field indicates that multiple biomarkers will be needed for accurate assessment of a particular food [Gibbons et al., 2017]. D 'Angelo et al. [to appear] develop a method that combines multiple biomarkers with apple intake data from an intervention study in a Bayesian framework; the approach is easily implemented by practitioners ei-ther through an associated web application (available at adiet.shinyapps.io/multiMarker) or via an open-source R [R Core Team, 2021] software package (available at CRAN.Rproject.org/package=multiMarker). There is ample scope for new statistical methods to combine biomarker and self-reported intake data to improve estimation of food intake.