Calorie Estimation From Pictures of Food: Crowdsourcing Study

doi:10.2196/ijmr.9359

Original Paper

¹Department of Computer Science, Columbia University, New York, NY, United States

²Department of Linguistics, University of Arizona, Tucson, AZ, United States

³Department of Computer Science, University of Arizona, Tucson, AZ, United States

⁴Department of Nutritional Sciences, University of Arizona, Tucson, AZ, United States

Corresponding Author:

Stephen Kobourov, BS, MS, PhD

Department of Computer Science

University of Arizona

Gould-Simpson 917

1040 East 4th Street

Tucson, AZ, 85721

United States

Phone: 1 520 621 4632

Email: kobourov@email.arizona.edu

Background: Software designed to accurately estimate food calories from still images could help users and health professionals identify dietary patterns and food choices associated with health and health risks more effectively. However, calorie estimation from images is difficult, and no publicly available software can do so accurately while minimizing the burden associated with data collection and analysis.

Objective: The aim of this study was to determine the accuracy of crowdsourced annotations of calorie content in food images and to identify and quantify sources of bias and noise as a function of respondent characteristics and food qualities (eg, energy density).

Methods: We invited adult social media users to provide calorie estimates for 20 food images (for which ground truth calorie data were known) using a custom-built webpage that administers an online quiz. The images were selected to provide a range of food types and energy density. Participants optionally provided age range, gender, and their height and weight. In addition, 5 nutrition experts provided annotations for the same data to form a basis of comparison. We examined estimated accuracy on the basis of expertise, demographic data, and food qualities using linear mixed-effects models with participant and image index as random variables. We also analyzed the advantage of aggregating nonexpert estimates.

Results: A total of 2028 respondents agreed to participate in the study (males: 770/2028, 37.97%, mean body mass index: 27.5 kg/m²). Average accuracy was 5 out of 20 correct guesses, where “correct” was defined as a number within 20% of the ground truth. Even a small crowd of 10 individuals achieved an accuracy of 7, exceeding the average individual and expert annotator’s accuracy of 5. Women were more accurate than men (P<.001), and younger people were more accurate than older people (P<.001). The calorie content of energy-dense foods was overestimated (P=.02). Participants performed worse when images contained reference objects, such as credit cards, for scale (P=.01).

Conclusions: Our findings provide new information about how calories are estimated from food images, which can inform the design of related software and analyses.

Interact J Med Res 2018;7(2):e17

doi:10.2196/ijmr.9359

Keywords

calorie estimation; image annotation; crowdsourcing; obesity; public health

Background

Estimating calories in pictures of food is an important task, providing data to inform nutrition research and practice and helping individuals achieve optimal, balanced dietary intakes. Yet this task turns out to be difficult for both experts and nonexperts. We are using this study as an opportunity to enhance our understanding of whether and how calorie estimation works “in the wild,” that is, in real-world scenarios. There are many applications of this understanding, ranging from improving the methodological rigor (and reducing the associated burden) of dietary assessment, a pervasive and unanswered question in nutrition science, to influencing the design of interventions focused on dietary behavior change.

The fact that individuals do not estimate calories well [1-4] has motivated the design of software apps to help individuals better estimate different aspects of dietary intake (eg, calories, energy density, nutrient density, and portions) using machine learning (ML) and by harnessing the “wisdom of the crowd.” The latter phenomenon was first documented in a 1907 Nature paper [5] and has been successfully used in many domains, ranging from gene network inference [6] to computational problems [7]. Apps in this space remain quite difficult to use, requiring burdensome manual logging of what one eats, or, when ML is used to classify pictures of foods, explicit weight values to be entered manually. To a large extent, the identification of calorie content from images of food either through crowd sourcing or ML remains an open research question. This work is a necessary step toward the automated identification of calorie content from images of food.

Objectives

The aim of this study was to determine the accuracy of crowdsourced annotations of calorie content in food images and to identify and quantify sources of bias and noise as a function of respondent characteristics and food qualities (eg, energy density).

Procedure

The proposed task is essentially a combination of 2 tests individuals must engage in when estimating calories. The first test relates to the relative energy density of the food pictured, whereas the second test discerns the portion size. Thus, we contend that the ecological validity of our approach is high, despite the task’s complexity. The study protocol described herein was reviewed by an institutional review board at the University of Arizona and met the criteria for exemption under 45 CFR 46.101(b).

We designed a simple online quiz administered by a custom-built webpage to measure the accuracy of calorie estimation in pictures of food, verify the existence of collective wisdom, and analyze data and find patterns and trends that can be useful in the design of calorie-tracking apps.

We posted the quiz to SampleSize [8], a subreddit (ie, a forum on reddit) dedicated to posting surveys and survey results. This choice was made on the basis of having a large, active user base that reflects the demographics likely to make large-scale food annotations for reasons of personal interest in self-quantification.

The quiz began with a short introduction: “We would like to see whether you have a good understanding about calories. We will show you several pictures of food and your task, should you choose to accept it, is to guess how many calories are in the food. We will not share any identifying information about you. All of the data is anonymous.”

The quiz included 20 questions. Each question consisted of a picture of some food item (see Figure 1) and the prompt, “How many calories are in the food pictured here? (Type a number in the box between 50 and 800).” Implausible dietary data, from (un)intentional under-reporting or over-reporting, are a pervasive problem in nutrition research and can introduce bias or lead to erroneous interpretations of diet-weight or diet-disease relationships. A common way of handling this issue is to exclude extreme values after the fact based on the distribution of the data (eg, removing data more than 2 SDs from the mean) or by subjective assessment [9]. In contrast, we provided the upper and lower limits on the guesses, based on the ground truth data, to ease an already difficult task and thereby reduce the amount of data that would later be necessary to remove. The numbers also helped clarify that we were referring to kilocalories and helped reduce outliers. Neither the correct calorie amounts nor other participants’ answers were visible to a participant during the estimation portion of the experiment, although it is possible that some might have read the reddit comments before participation, which revealed some calorie values. We decided to not add additional information to the pictures (eg, does the sandwich contain mayonnaise?) to keep the task closer to a realistic image annotation task.

Following the food-related questions, the participants were asked to provide their age group, gender, and body mass index (BMI). An option to calculate BMI via height and weight information was also available. We deliberately chose not to ask for additional demographic questions (eg, location, income, and education) to protect participant privacy. We reported the accuracy of the individual participant who just completed the quiz, as well as the average accuracy of all prior participants, using a breakdown showing the performance of each question.

Quiz Materials

We used 2 categories of food for the pictures: single-ingredient (eg, broccoli, cheese) and mixed-ingredients (eg, sandwich, pizza). There were 20 pictures of food items in total (Figure 1): 12 single-ingredient and 8 mixed. The food shown in the pictures ranged from 100 to 720 kcal. Importantly, we chose these food items according to the United States Department of Agriculture’s (USDA) MyPlate model [10] that captures the building blocks for a healthy diet, and which includes 5 types of food (vegetables, fruits, protein, dairy, and grain), as well as mixed foods containing these ingredients. Our selection aimed to follow this model, to include realistic foods that appear in daily consumption, and to be concise so participants engage with the quiz.

The food portions selected are summarized in Table 1. The images were ordered so that each food type was maximally separated from other instances of its type, and the order was the same for each participant. We collected nutrition information about some food items from official restaurant websites. Although the calorie content of the foods pictured was not directly measured, US federal statute requires the published calorie values of restaurant food items to be within 20% of actual calorie value [11].

We chose not to inform the participants of the sources of the images, to reduce the potential that they would search the Web for “ground truth” data, for example, by going to the actual Burger King’s website. Likewise, the participants were not explicitly told that some images were from fast food restaurants.

Patterns and Analysis

A total of 3 measures were relevant to our analysis:

Error, e, is estimated kilocalories (ĉ) minus ground truth kilocalories (c), e = ĉ − c, and percent error, η, is error as a percentage of the ground truth kilocalories, η = e / c, both of which are positive in overestimation and negative in underestimation. Because of the variation in the ground truth kilocalories of the foods, the latter is a more reliable indicator of the scale of response bias.
Absolute error, |e|, measures accuracy irrespective of the direction of estimation bias (|e| = |ĉ − c|).
Discrete accuracy, D, is the number of estimates that were within 20% of the true calorie value (out of 20 estimates); discrete accuracy was the measure reported to quiz participants:

Before this analysis, we removed participants who reported a BMI less than 15 or more than 50 kg/m² (which are unlikely to be correct), and participants who did not report their gender. In addition, we eliminated responses of less than 50 kcal or greater than 800 kcal and retained all the remaining ones.

We analyzed the results of the survey using linear mixed-effects modeling in R [21,22], allowing regression with random intercepts for both participants and foods simultaneously. The R² values are the proportion of the variance in the data that is described by the models’ predicted values. For all analyses, a P value less than alpha=.05 was considered indicative of a statistically significant relation.

Figure 1. Untrained participants estimated the food calories in these 20 images.

Table 1. Foods were chosen for the quiz to attain maximum coverage of food types encountered in daily life by likely participants. Scaling refers to the presence of reference objects, such as credit cards, which could indicate food volume.

Food	Type	Energy (kcal)	Mass (g)	Scaling?	Source
Cheddar cheese	Dairy	200	51	No	wiseGEEK [12]
Gouda cheese	Dairy	300	84	Yes	HealthAssist [13]
Avocado	Fruit	200	125	No	wiseGEEK [12]
Kiwi	Fruit	200	328	No	wiseGEEK [12]
Brown rice	Grain	420	297.7	No	Panda Express [14]
Cereal	Grain	200	55	No	wiseGEEK [12]
Ham	Meat	300	185.1	Yes	HealthAssist [13]
Salami	Meat	300	72.9	Yes	HealthAssist [13]
Red onion	Vegetable	200	475	No	wiseGEEK [12]
Potato	Vegetable	100	141.7	No	Food Network [15]
Broccoli	Vegetable	200	588	No	wiseGEEK [12]
Cauliflower	Vegetable	300	1200	Yes	HealthAssist [13]
Cheeseburger	Mixed	270	104	No	Burger King [16]
Hot dog	Mixed	310	123	No	Burger King [16]
Green tea cake	Mixed	136	40	No	Wit Co, Ltd [17]
Long cheeseburger	Mixed	590	213	No	Burger King [16]
Pepperoni and sausage pizza	Mixed	240	97	No	Papa John’s Pizza [18]
Swiss roll	Mixed	251	96	No	Slism [19]
Tuna sandwich	Mixed	720	420	Yes	Jimmy John’s [20]
Turkey sandwich	Mixed	510	254	Yes	Jimmy John’s [20]

In total, 2125 individuals participated in our reddit quiz. After removing 97 participants with missing or invalid demographic data, 2028 individuals were included in the analysis.

Participant Demographics

The demographics of the participants are summarized in Figures 2-5. Although we collected no location data, an earlier study, again recruiting from the SampleSize subreddit, found that 67.4% (421/625) of participants reported a location within the United States [23], a rate that is similar to the 64% reported in another voluntary survey with participants from across reddit [24]. We also have a higher percentage of female participants than the US average, and a larger fraction of people with BMI around 25 kg/m². It is possible that the participants in our quiz were more interested in this topic than the average person. However, in their self-selection, they are more demographically similar than the average person to likely crowdsourcing annotators for potential future app development.

Participant Feedback

The participants volunteered their BMI and other demographic information, and 18 participants left 31 comments on the reddit thread. Table 2 summarizes the types of feedback comments we received, as well as some examples.

The feedback from the participants demonstrates engagement, interest, and curiosity. This implies that such tasks could be legitimately gamified (applying game mechanics and game design techniques to engage and motivate people to achieve their goals). It also shows that unlike Mechanical Turk participants, the participants in our study were engaged and motivated by intrinsic interest.

Note that our work addresses some of the requests shown in Table 2. For example, we found no increased accuracy from the presence of reference objects for scale in the pictures.

Figure 2. The reported gender of respondents to our quiz is compared with data from National Health and Nutrition Examination survey (NHANES).

Figure 3. The age of respondents to our quiz is compared with data from National Health and Nutrition Examination survey (NHANES).

Figure 4. The body mass index (kg/m²) of respondents to our quiz is compared with data from National Health and Nutrition Examination survey (NHANES).

Figure 5. The body mass index (kg/m²) category of respondents to our quiz is compared with data from National Health and Nutrition Examination survey (NHANES).

Table 2. Representative comments from the reddit post of the calorie estimation quiz.

Type	Example
Fun	“That was fun! I think the folks in ‘loseIt’ [another subreddit] and on various MFP [MyFitnessPal] forums would enjoy taking this, too.” ‎
Surprise	“I’m really really doubtful that burger is only 270 cal.” ‎ “[N]o way are two red onions 200 calories.” ‎
Units	“[C]ountries other than the US use the actual unit of energy- Joules” ‎
Scale	“It would have been great to have a ruler next to the food.” ‎
	“[I]f you show me a plate of rice, I can’t guess how much rice are on the plate because I don’t know how big the plate is.” ‎
Difficulty	“Shoot, got 1 right out of 20 LOL. No wonder my BMI is 29.” ‎
	“I dont know if there was mayo on [the submarine sandwiches] or not, which changes things a lot.” ‎

How Good Are People at Estimating Calories?

The participants' estimates had a mean absolute error (|e|) of 57.9% (136 kcal). In terms of discrete accuracy (D), the mean participant answered 5.15 questions correctly out of 20. Figure 6 shows the distribution of correct responses. Absolute error varied considerably by item from the most accurate item—a turkey sandwich, with a mean absolute error of 23.0% (39 kcal)—to the least—green tea cake, at 241.0% (327 kcal) absolute error. Figure 7 illustrates the variety of estimates and percent error (η) distributions for different items. Together, these facts show that human calorie estimates are both inaccurate overall and inconsistent in their inaccuracies.

Does the Wisdom of the Crowd Phenomenon Apply Here?

A consensus formed rapidly for each food, as shown in Figure 8 (see the dashed orange line in the figure), so that 10 responses gave a very good estimate of the next 1000 responses. In fact, a bootstrap significance test shows that the average of 10 randomly selected participants’ guesses is no more (or less) accurate than the average of those of 1000 random participants (P=.36). Moreover, the consensus responses had greater discrete accuracy (D) than that of the individual participants, achieving 7 correct responses out of 20, a 36% relative improvement over the 5.15 correct among individual participants. This result is consistent with previous studies demonstrating the wisdom of the crowd, in which the accuracy of consensus judgments exceeds that of individual judgments (see Comparison With Prior Work).

Another important observation is that although error was high for individual responses and individual foods, the bias in the errors was low overall across all questions, such that the median of the error across items and participants is 0 (when using crowdsourcing over 2028 participants). Although this result is not actionable in itself, as it is averaged across all questions, it does demonstrate the power of crowds to converge toward high-accuracy judgments.

Figure 6. A histogram of the number of correct estimates each participant made. See Patterns and Analysis for the definition of this measure, D.

Figure 7. Calorie estimates and percent error (η) for each food item. For each food item, the violin plots represent the distribution of the calorie estimates by the participants and their percent error. The bottom and top of the boxes represent the first and third quartile, and the red band represents the mean of the calorie estimates, respectively. The green band represents the actual calorie value for each food item.

Figure 8. Mean estimates for each food as more participants are added show that a consensus forms rapidly. The dotted blue lines show the true calorie value for each food. The x-axis uses a logarithmic scale. The orange dashed line indicates the estimates of nonexperts. The green continuous line represents the estimates of nutrition science experts. Note that the range of acceptable calorie estimates was 50 to 800 calories for each food item.

Figure 9. Participants underestimated the calorie content of calorie-sparse foods and overestimated that of calorie-rich foods.

Do the Nutritional Experts Outperform the Crowd?

In addition to redditors, we solicited participation from 5 nutritional experts. We recruited a faculty on a voluntary basis from the Department of Nutritional Science at the University of Arizona and the School of Nutrition and Health Promotion at Arizona State University. Somewhat surprisingly, neither the absolute error of their responses nor their discrete accuracy was statistically different from those of the average nonexpert participant (P=.19). In fact, a small crowd of only 2 randomly selected nonexperts was required to outperform the highest performing expert, achieving an average absolute error (|e|) of 119.3 (52.3%) compared with the expert’s 130.2 (55.3%). Expert performance is shown in comparison with nonexpert performance in Figure 8. This result is consistent with the hypothesis that the sources of error (eg, erroneous volume estimation due to a notion of typical portion size) apply equally to experts and nonexperts. Prior work in many domains of estimation has supported the notion that a relatively small group of nonexperts can estimate just as well as a single expert [25,26] (see also Comparison With Prior Work).

Does Having an Object for Scale in the Picture Help?

Several comments in the reddit thread expressed the hypothesis that pictures featuring a standard-sized reference object (such as a credit card) were easier to answer. The results showed that reference objects, far from aiding estimation, increased absolute error (|e|) by a mean 4.6 kcal (P=.01, R²=.31). Our hypothesis is that participants used background knowledge about the typical size of foods to scale foods but were not able to profit from comparison against the reference objects. This is statistically significant evidence for the notion that scale information does not aid calorie estimation in digital images (compare [27]). However, it is important to note that this was a post hoc analysis only; the experiment was not designed to analyze this hypothesis. For example, we included objects that come in many different sizes (eg, forks) as reference objects, which may have confused the quiz takers. We leave a more careful evaluation of this particular observation as future work.

Does Energy Density of Foods Predict Estimation Error?

As shown in Figure 9, the caloric content of energy-dense foods was systematically overestimated, and that of energy-sparse foods underestimated, as measured by error (e, P=.02, R²=.57). This bias is similar to one found by Almiron-Roig et al [28] in estimating in-person portion sizes and could reflect 2 nonexclusive sources. First, it could result from the perceived healthiness of the food items [29]. For example, broccoli is a prototypically healthy food but is not devoid of calories; conversely, prototypically unhealthy foods such as cheeseburgers have often been “engineered” for low calories [30]. This explanation aligns with the results of Carels et al [1], who found that college students overestimated the caloric content of foods considered to be unhealthy while they underestimated the number of calories in healthy foods. Second, the bias could result from an assumption that the items would have a similar weight to one another, when in fact there was an inverse relationship between the energy density and weight of the items (Pearson correlation: ρ=–.70). We hypothesize that inelastic adjustment of portion size according to energy density could contribute to obesity.

Does Body Mass Index Predict Estimation Errors?

BMI itself does not predict accuracy or bias in these data, similar to Blake et al [31] and Chandon and Wansink [32]. Other studies show that overweight and obese individuals consistently under-report calorie intake to a greater degree than nonoverweight individuals [33,34]. However, BMI does significantly interact with energy density in predicting percent error (η, P=.002, R²=.57), such that the higher a participant's BMI, the more they exaggerated the calorie content of calorie-rich foods. We hypothesize that overweight individuals are more sensitive to perceptions of food.

Do Gender and Age Predict Estimation Errors?

No biasing effect (toward underestimation, for example) was found, but absolute error (|e|) was greater for men than for women (P<.001, R²=.31), similar to the portion judgment result by Almiron-Roig et al [28]. In addition, the absolute error was greater for older participants (P<.001, R²=.31), but these effects did not interact. Figures 10 and 11 summarize these differences. We hypothesize that the primary reason for these differences is cultural, reflecting gender norms and the relatively recent cultural emphasis on calories as a measure of healthiness.

Do Estimation Errors Cluster by Food Type?

The over- and underestimation of errors for some foods correlate with those for others. For example, a participant who underestimates the calories in broccoli is likely to do so for cauliflower as well. Figure 12 shows an automatically generated map [35] illustrating these correlations, with clusters showing similar subnetworks. A larger map with more food items would be a strong basis for predicting human bias on clusters of food types (eg, vegetables).

Figure 10. The absolute error (|e|) of participants differs by gender. Box edges show the first and third quartiles and are split by the median. The boxes’ whiskers extend to the farthest point within 1.5 times the interquartile range from the box ends. The notches denote the 95% CI of the median. The y-axis is on a square-root scale.

Figure 11. The absolute error (|e|) of participants differs by age. Box edges show the first and third quartiles and are split by the median. The boxes’ whiskers extend to the farthest point within 1.5 times the interquartile range from the box ends. The notches denote the 95% CI of the median. The y-axis is on a square-root scale.

Figure 12. This map shows a network of food items in our survey based on correlation of estimation errors. Pairs of strongly correlated foods are connected by edges. The stronger the correlation, the closer the two (distances are inverse to correlation) are. Clusters show groups of similar subnetworks.

Principal Findings

The above analysis identifies several patterns that are important for the design of calorie estimation apps.

First and foremost, our study demonstrates that individuals are poor judges of calorie content in images, and prior work has shown that they are poor judges of portion size in real-life situations (see Comparison With Prior Work). This suggests the utility of an ML approach to calorie estimation to facilitate meal planning. Keeping track of calories by describing foods and guessing quantities and values is a tedious and inaccurate strategy, yet it is the strategy most commonly used in apps today. Given that “a picture is worth a thousand words,” our initial hypothesis was that using images (rather than descriptions of foods) should lead to better estimates. Our results, however, do not support this hypothesis: on average, participants performed poorly at estimating the amount of calories in pictures of food, answering 5.15 of 20 questions correctly on average. Our analysis indicates that participants in our dataset tended to exaggerate common dietary knowledge; they underestimated the number of calories in energy-sparse foods and overestimated them in energy-dense ones.

Our related work discussion (Comparison With Prior Work, below) highlights that estimating calories using ML remains an open research problem. However, our work suggests that such apps could take advantage of the wisdom of the crowd for estimation. We showed that the crowd performs better than experts, on average, even when the crowd is small. This suggests that this annotation could be implemented accurately and at low cost.

The results suggest that for apps that focus on calorie monitoring (including self-reporting), it might be a good idea to characterize the users’ demographic data (age, gender, and BMI) shown to influence the accuracy of calorie estimates either directly or when combined with other factors such as energy density.

We identified additional patterns that simplify the design and implementation of calorie-tracking apps. The first such pattern is that scale information does not improve estimation accuracy. The second is that estimation errors cluster by food types, which indicates that the app may extrapolate user patterns between foods in the same group.

It is important to note that the observations of this study are statistically significant and applicable to the population of interest to us (ie, individuals likely to participate in crowdsourced annotations). This population is considerably younger than the US population (χ²₆=3362.5, P<.001) and contains more women proportionally (χ²₁=81.0, P<.001). In future work, we aim to repeat this study for a larger population that matches known demographics to verify the validity of our analysis on such populations.

Comparison With Prior Work

Related work includes prior work in nutritional sciences, ML, image processing, and crowdsourcing. We review a small but representative subset below.

Nutrition and Diet

Bandini et al [36] and Schoeller et al [37] have reported that individuals tend to selectively under-report the energy intake when these data are manually logged. This seems to be especially true for overweight and obese individuals [33,34] and could be associated with a failure to accurately estimate portions, although Blake et al [31] and Chandon and Wansink [32] found that BMI does not correlate with the ability to estimate calories when this task is conducted in person. Portion estimation of in-person food remains poor, whether in reference to images on computer screens or on printed images [38]. However, calorie estimation of large meals may be worse than that of small meals [39].

To monitor dietary intake more accurately, third-party automated food analysis systems have been proposed. Martin et al [40] used the remote food photography method (RFPM), which requires individuals to upload 3 pictures when having a meal: the plate of the foods selected by an individual, standard portions of known quantities of the foods, and the leftovers. These pictures are sent to trained dietitians who verify portions with participants and analyze these data using a standardized nutrient database. This approach relies on the judgment of trained nutrition professionals and argues for the validity of RFPM. Providing all 3 pictures for each meal is a challenge, as indicated by Williamson et al [41]. Beltran et al [42] tested the reliability of the eButton system, in which a camera worn on the chest records images continuously. The images are captured passively while the participant goes about their day, but such a system still requires experts to identify foods in the images and confirm them with participants. Similar to the RFPM employed by Martin et al [40], the eButton system requires valid pictures before and after each meal, camera placement at a certain angle, and proper lighting. Although promising, such systems are unlikely to scale to the millions of people who would like to accurately track their nutritional intake.

Machine Learning and Image Processing

Given the challenges of the systems described above, a system that can automatically measure calories in pictures of food would be in great demand. Image processing techniques can be used to recognize food in images, and ML can be used to estimate the calories in the food.

Menu-Match [43] uses a database of restaurants and Global Positioning System locations and attempts to guess what is in the picture, using image features such as color and scale-invariant feature transforms [44]. It has not been made available to the general public. Im2Calories [45] is built on the work of Menu-Match. A multi-label classifier is trained on a collection of images of food. The app locates the restaurant a user is dining in and, given an image from the user, the classifier (running on the user’s phone) guesses which foods are present in the meal. Looking up the nutritional facts provided by the restaurant, using the resulting estimates, yields good results. Note, however, that Im2Calories has not been made available to the general public or even for research purposes.

Bettadapura et al [46] show that food recognition using location data improves accuracy. Such systems, however, are inherently limited to the restaurants whose menus are in the database. These also assume that menus do not change often and that the volume of food is the same from plate to plate. In reality, most meals are eaten either outside of restaurants or in restaurants whose menus are not included in some dataset. The “in the wild” problem is more natural but also more difficult.

The Web app Foodlog [47,48] divides food images into 300 blocks each and extracts discrete cosine transform coefficients and color histogram from each block. Using these data, Foodlog classifies the food into 5 categories according to the USDA's My Pyramid system. Experimental results report 88% accuracy in the extraction of food and 73% accuracy in food balance estimation. The FoodCam system [49] segments the region of each food by GrabCut (an image segmentation approach based on iterative graph-cuts) [50], extracts image features of histogram of oriented gradients [51] and color patches with the Fisher Vector (an image representation obtained by pooling local image features) [52], and finally classifies it into 1 of 100 food categories using linear support vector machines.

With the exception of Im2Calories, the systems above achieve relatively good food recognition but without volume estimation. To estimate volume, Chae et al [53] minimize the false-segmented regions, smooth the segmentation boundaries of food, and reconstruct 3D primitive shapes from a single food image. He et al [54] estimate the weight of food given a single image using a shape template for regular-shaped foods and area-based weight estimation for irregularly shaped food. The Im2Calories system [45] estimates the distance of every pixel from the camera by using a convolutional neural net architecture, converts the depth map into a voxel representation, and estimates the volume of the food. Although such approaches are effective, there is no app for estimation of food volumes available to the general public.

Crowdsourcing

Crowdsourcing sometimes makes it possible to use multiple nonexpert judgments to approach the high quality of expert annotation [55]. Surowiecki [25] argues that in many instances, the average nonexpert estimates can even outperform a single expert. Watson has shown that the average of the individual judgments can be equal or superior to the judgment of the best individual within the group [26]. Moreover, the validity of judgments increases with more judges [56]. The strength of the wisdom of the crowd over ML is well understood and exploited in industry. For example, CardMunch (now a service of Evernote [57]) uses crowdsourcing with Amazon Mechanical Turk to convert pictures of business cards into digital contact information. Eloquent Labs [58] uses a mix of crowdsourcing with an artificial intelligence to implement a conversational assistant for customer service.

In the nutrition domain, Mamykina et al [59] show that crowdsourced ingredient annotations from food images are improved by expert annotation and by showing the annotators previous annotations of the images. The PlateMate [60] app leverages crowdsourcing to implement the first step in the RFPM. Rather than typing names of foods and estimating portions, users take photographs of their plates both at the beginning of the meal and at the end to accurately capture how much food was actually eaten. PlateMate uses annotations from nonexpert Amazon Mechanical Turk workers instead of expert dietitians to estimate the composition of foods in static images. PlateMate's results are as accurate as the experts. Similarly, the Im2Calories [45] project uses crowdsourcing to annotate all the food terms that apply to an image. Manually merging synonymous terms, they create the Food201-Multilabel dataset for training. Compared with the original Food101 classes, the new classes of Food201-Multilabel do better according to mean average precision, as they often correspond to side dishes or small food items.

In sum, despite the abundance of interest in this and related topics, including calorie-tracking apps with manual entry, there exists no publicly available app that will accurately estimate calories from a single image. Likewise, although there are many studies of human bias in tracking calories and lack of skill in estimating portion sizes, no previous work establishes the accuracy and biases of crowdsourcing for calorie estimation, or what demographic factors might correlate with accuracy.

Learning from our study, we envision a very simple app, where the only action required from the user is to take a picture of her or his food. The estimation logic, driven by the wisdom of the crowd and ML, would be transparent to the user, that is, it would be triggered automatically when the camera is used. The logic includes (1) detecting if the picture is a picture of food using image classification [61,62] and (2) routing the image for crowd annotations (similar to CardMunch, which routes the task of processing images of business cards to the crowd). We hope that this simplicity will yield wide adoption, which, in turn, will lead to measurable effects in dietary choices.

Limitations

Participants were not directly informed that some images were of fast food and thus more likely to be subject to food engineering, for example, replacing sugar or using sweetness enhancers, or adding water or protein to enhance food properties and palatability [30,63]. The fact that this was not explicitly mentioned to the participants raises the possibility that participants might have considered these foods as “homemade,” which may have influenced perceived energy density and calories. However, since a majority of hamburgers are eaten at restaurants rather than homemade, judgments about engineered foods are as or more relevant than home-cooked foods for both naturalistic and app-related purposes

Conclusions

We described a study measuring the ability of over 2000 individuals to estimate calories in 20 pictures of food chosen to capture the building blocks of a healthy diet [10]. We believe this study should be read as an analysis that drives the design of future food-related apps, with additional impacts on crowdsourcing strategies and the design of human-computer interfaces.

Our analysis confirms some earlier observations (eg, calorie estimation is a difficult task, even for the experts) and offers new insights:

Even a small crowd of 2 nonexperts achieves calorie estimation accuracy greater than that of the expert annotators. This suggests that semiautomated food labeling apps can be implemented at a low cost by harnessing the wisdom of the crowd, even when the crowd is small. Note that some prior approaches in this space, such as PlateMate [60], use crowdsourcing to provide calorie information to users. To the best of our knowledge, the crowdsourcing method has never been tested as a source of data for algorithmic calorie estimation before.
We found new type-of-food effects, with energy-dense foods (such as hamburgers) being consistently overestimated and energy-sparse foods (such as broccoli) consistently underestimated. Future crowdsourcing (or ML) projects aiming to annotate food for calorie content will benefit from correction using these biases.
We found the absence of some expected correlations. For example, the presence of reference objects for scale does not improve accuracy but rather slightly decreases accuracy, and the BMI is not correlated with accuracy. These observations impact the design of interfaces for annotation apps, as well as data collection protocols.

All in all, this work suggests that calorie-estimation apps are needed and can be built at low cost (eg, using small annotator groups, and without the overhead of including reference objects in images, or controlling for the BMI of users).

Several interesting research questions remain. First, given the low calorie estimation accuracy (5 out of 20) and some clear patterns (underestimating “healthy” foods and overestimating “unhealthy” foods), it is natural to ask whether simple training with feedback can help improve accuracy for nonexperts. If so, how much training is required, what gains in accuracy can be obtained, and how much further can the crowd boost the results? Second, can we factor in biases (eg, age, gender) to obtain better crowdsourced prediction? Third, can better (more consistent) reference objects lead to improvements in accuracy? Fourth, assuming the baseline accuracy for “simple” foods (eg, fruits, vegetables, and sandwiches) can be improved with some of the ideas above, can we hope to tackle more difficult challenges, such as amorphous foods (porridge, mashed potatoes) and liquids (soups, smoothies) in which ingredients and volume are less obvious? Lastly, but perhaps most importantly, we aim to apply the knowledge gained from this study beyond the understanding of how (or how well) people estimate calories to include assessment of diet quality, which has become a dietary construct of interest in the past 5 years [64]. This change has occurred because dietary patterns and dietary quality (eg, increased nutrient density, nutrient diversity, and nutrient adequacy) have been strongly associated with health and disease outcomes. This information provides potentially more meaningful metrics than the number of calories (which says nothing about the quality or “healthiness” of the food) when providing participants or patients with feedback.

We believe this study should be read as an analysis that informs the design of future food-related apps (in particular, apps that feature calorie estimation), with additional potential impacts on crowdsourcing strategies and the design of human-computer interfaces. Our future goal is to provide estimates about judging calories from images for the purpose of mass annotation (eg, in support of a calorie-estimation app), which, in turn, is part of a larger system that analyzes text, images, and videos to estimate risk of diet-sensitive diseases such as type 2 diabetes mellitus [23,65].

Conflicts of Interest

DB and MS disclose a financial interest in lum.ai. This interest has been disclosed to the University of Arizona Institutional Review Committee and is being managed in accordance with its conflict of interest policies.

Carels RA, Konrad K, Harper J. Individual differences in food perceptions and calorie estimation: an examination of dieting status, weight, and gender. Appetite 2007 Sep;49(2):450-458. [CrossRef] [Medline]
Carels RA, Harper J, Konrad K. Qualitative perceptions and caloric estimations of healthy and unhealthy foods by behavioral weight loss participants. Appetite 2006 Mar;46(2):199-206. [CrossRef] [Medline]
Block JP, Condon SK, Kleinman K, Mullen J, Linakis S, Rifas-Shiman S, et al. Consumers' estimation of calorie content at fast food restaurants: cross sectional observational study. Br Med J 2013 May 23;346:f2907. [CrossRef]
Brown RE, Canning KL, Fung M, Jiandani D, Riddell MC, Macpherson AK, et al. Calorie estimation in adults differing in body weight class and weight loss status. Med Sci Sports Exerc 2016 Mar;48(3):521-526. [CrossRef] [Medline]
Galton F. Vox Populi. Nature 1907 Mar 7;75:450-451 [FREE Full text] [CrossRef]
Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, The DREAM5 Consortium, et al. Wisdom of crowds for robust gene network inference. Nat Methods 2012 Jul 15;9(8):796-804. [CrossRef] [Medline]
Yi SK, Steyvers M, Lee MD, Dry MJ. The wisdom of the crowd in combinatorial problems. Cogn Sci 2012 Apr;36(3):452-470. [CrossRef] [Medline]
Reddit. 2017. /r/SampleSize: Where your opinions actually matter! URL: https://www.reddit.com/r/SampleSize/ [accessed 2017-09-22] [WebCite Cache]
Huang TT, Roberts SB, Howarth NC, McCrory MA. Effect of screening out implausible energy intake reports on relationships between diet and BMI. Obes Res 2005 Jul;13(7):1205-1217 [FREE Full text] [CrossRef] [Medline]
United States Department of Agriculture. Choosemyplate. 2017. MyPlate Model URL: https://www.choosemyplate.gov/MyPlate [WebCite Cache]
Food and Drug Administration. Food labeling; nutrition labeling of standard menu items in restaurants and similar retail food establishments. Final rule. Fed Reg 2014;79(230):71155-71259 [FREE Full text]
wiseGEEK. 2017. What does 200 Calories Look Like? URL: http://www.wisegeek.com/what-does-200-calories-look-like.htm [WebCite Cache]
HealthAssist. 2017. 300 Calorie Food Picture Gallery URL: http://www.healthassist.net/food/300kcal/300.shtml [WebCite Cache]
Panda Express. 2017. Brown Steamed Rice URL: https://www.pandaexpress.com/menu/sides/brown-steamed-rice [WebCite Cache]
Food Network. 2017. Foods with 100 Calories URL: http://www.foodnetwork.com/healthy/photos/foods-with-100-calories [WebCite Cache]
Burger King. 2017. URL: https://www.bk.com/ [WebCite Cache]
Wit Co, Ltd. 2017. Matcha calorie calculation of cake URL: http://www.asken.jp/calculate/meal/94371 [WebCite Cache]
Papa John's Pizza. 2017. URL: https://www.papajohns.com/ [WebCite Cache]
Slism. 2017. Food nutrients can be seen calorie calculation at a glance URL: http://calorie.slism.jp/200528/ [WebCite Cache]
Jimmy John's. 2017. Jimmy John's Gourmet Sandwiches URL: https://www.jimmyjohns.com/ [WebCite Cache]
Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. J Stat Softw 2015;67(1) [FREE Full text] [CrossRef]
Kuznetsova A, Brockhoff PB, Christensen R. lmerTest Package: tests in linear mixed effects models. J Stat Softw 2017;82(13):1-26. [CrossRef]
Bell D, Fried D, Huangfu L, Surdeanu M, Kobourov S. Towards using social media to identify individuals at risk for preventable chronic illness. 2016 Presented at: Tenth International Conference on Language Resources and Evaluation (LREC); May 23-28, 2016; Portorož, Slovenia p. 23-28 URL: http://www.lrec-conf.org/proceedings/lrec2016/summaries/30.html
Reddit. 2011. Who in the World is reddit? Results are in URL: https://redditblog.com/2011/09/12/who-in-the-world-is-reddit-results-are-in/ [WebCite Cache]
Surowiecki J. The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. New York: Doubleday; 2004.
Watson GB. Do groups think more efficiently than individuals? J Abnorm Soc Psychol 1928;23(3):328-336. [CrossRef]
Hernández T, Wilder L, Kuehn D, Rubotzky K, Moser-Veillon P, Godwin S, et al. Portion size estimation and expectation of accuracy. J Food Compos Anal 2006 Aug;19:S14-S21. [CrossRef]
Almiron-Roig E, Solis-Trapala I, Dodd J, Jebb SA. Estimating food portions. Influence of unit number, meal type and energy density. Appetite 2013 Dec;71:95-103. [CrossRef]
Faulkner GP, Pourshahidi LK, Wallace JM, Kerr MA, McCaffrey TA, Livingstone MBE. Perceived ‘healthiness’ of foods can influence consumers’ estimations of energy density and appropriate portion size. Int J Obes Relat Metab Disord 2013 May 07;38(1):106-112. [CrossRef]
Pérez-Escamilla R, Obbagy JE, Altman JM, Essery EV, McGrane MM, Wong YP, et al. Dietary energy density and body weight in adults and children: a systematic review. J Acad Nutr Diet 2012;112(5):a. [Medline]
Blake AJ, Guthrie HA, Smiciklas-Wright H. Accuracy of food portion estimation by overweight and normal-weight subjects. J Am Diet Assoc 1989 Jul;89(7):962-964. [Medline]
Chandon P, Wansink B. Is obesity caused by calorie underestimation? A psychophysical model of meal size estimation. J Mark Res 2007 Feb;44(1):84-99. [CrossRef]
Bailey RL, Mitchell DC, Miller C, Smiciklas-Wright H. Assessing the effect of underreporting energy intake on dietary patterns and weight status. J Am Diet Assoc 2007 Jan;107(1):64-71. [CrossRef]
Kretsch MJ, Fong AK, Green MW. Behavioral and body size correlates of energy intake underreporting by obese and normal-weight women. J Am Diet Assoc 1999 Mar;99(3):300-306. [CrossRef]
Kobourov SG, Pupyrev S, Simonetto P. Visualizing graphs as maps with contiguous regions. 2014 Presented at: Eurographics Conference on Visualization (EuroVis); 2014; Swansea, Wales, UK URL: https://pdfs.semanticscholar.org/c3dd/bbb159ece5672c4488604fdbe1acd25fae59.pdf
Bandini LG, Schoeller DA, Cyr HN, Dietz WH. Validity of reported energy intake in obese and nonobese adolescents. Am J Clin Nutr 1990 Sep;52(3):421-425. [CrossRef] [Medline]
Schoeller DA, Bandini LG, Dietz WH. Inaccuracies in self-reported intake identified by comparison with the doubly labelled water method. Can J Physiol Pharmacol 1990 Jul;68(7):941-949. [CrossRef]
Hernández T, Wilder L, Kuehn D, Rubotzky K, Moser-Veillon P, Godwin S, et al. Portion size estimation and expectation of accuracy. J Food Comp Anal 2006;19:S14-S21 [FREE Full text]
Wansink B, Chandon P. Meal size, not body size, explains errors in estimating the calorie content of meals. Ann Intern Med 2006 Sep 05;145(5):326. [CrossRef]
Martin CK, Han H, Coulon SM, Allen HR, Champagne CM, Anton SD. A novel method to remotely measure food intake of free-living individuals in real time: the remote food photography method. Br J Nutr 2009 Feb;101(3):446-456 [FREE Full text] [CrossRef] [Medline]
Williamson DA, Allen HR, Martin PD, Alfonso A, Gerald B, Hunt A. Digital photography: a new method for estimating food intake in cafeteria settings. Eat Weight Disord 2013 Jul 26;9(1):24-28. [CrossRef]
Beltran A, Dadabhoy H, Chen T, Lin C, Jia W, Baranowski J, et al. Adapting the eButton to the abilities of children for diet assessment. 2016 Presented at: 10th International Conference on Methods and Techniques in Behavioral Research; May 25-27, 2016; Dublin, Ireland p. 25-27 URL: https://www.measuringbehavior.org/files/2016/MB2016_Proceedings.pdf
Beijbom O, Joshi N, Morris D, Saponas S, Khullar S. Menu-Match: Restaurant-Specific Food Logging from Images. 2015 Presented at: IEEE Winter Conference on Applications of Computer Vision; 2015; Waikoloa, HI, USA. [CrossRef]
Lowe D. Object recognition from local scale-invariant features. 1999 Presented at: Seventh IEEE International Conference on Computer Vision; 1999; Kerkyra, Greece. [CrossRef]
Myers A, Johnston N, Rathod V, Korattikara A, Gorban A, Silberman N, et al. Im2Calories: Towards an Automated Mobile Vision Food Diary. 2015 Presented at: IEEE International Conference on Computer Vision (ICCV); 2015; Santiago, Chile. [CrossRef]
Bettadapura V, Thomaz E, Parnami A, Abowd G, Essa I. Leveraging Context to Support Automated Food Recognition in Restaurants. 2015 Presented at: IEEE Winter Conference on Applications of Computer Vision; January 5-9, 2015; Waikoloa, HI, USA. [CrossRef]
Kitamura K, Yamasaki T, Aizawa K. Food log by analyzing food images. 2008 Presented at: 16th ACM international conference on Multimedia - MM 08; 2008; Vancouver, BC, Canada. [CrossRef]
Kitamura K, Yamasaki T, Aizawa K. FoodLog. 2009 Presented at: ACM multimedia workshop on Multimedia for cooking and eating activities - CEA 09; 2009; Beijing, China. [CrossRef]
Kawano Y, Yanai K. FoodCam: A real-time food recognition system on a smartphone. Multimed Tools Appl 2014 Apr 12;74(14):5263-5287. [CrossRef]
Rother C, Kolmogorov V, Blake A. Grabcut: interactive foreground extraction using iterated graph cuts. ACM Trans Graph 2004;23(3):309-314. [CrossRef]
Dalal N, Triggs B. Histograms of oriented gradients for human detection. 2005 Presented at: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05); June 20-25, 2005; San Diego, CA, USA. [CrossRef]
Csurka G, Perronnin F. Fisher vectors: beyond bag-of-visual-words image representations. In: Communications in Computer and Information Science Computer Vision, Imaging and Computer Graphics Theory and Applications. Berlin, Heidelberg: Springer; 2011:28-42.
Chae J, Woo I, Kim S, Maciejewski R, Zhu F, Delp EJ, et al. Volume estimation using food specific shape templates in mobile image-based dietary assessment. Proc SPIE Int Soc Opt Eng 2011 Feb 07;7873:78730K [FREE Full text] [CrossRef] [Medline]
He Y, Xu C, Khanna N, Boushey CJ, Delp EJ. Food image analysis: Segmentation, identification and weight estimation. 2013 Presented at: IEEE International Conference on Multimedia and Expo (ICME); 2013; San Jose, CA, USA. [CrossRef]
Brabham DC. Crowdsourcing as a model for problem solving: an introduction and cases. Convergence 2008;14(1):75-90. [CrossRef]
Gordon K. Group judgments in the field of lifted weights. J Exp Psychol 1924;7(5):398-400. [CrossRef]
Evernote. 2017. Get organized. Work smarter. Remember everything URL: http://evernote.com [accessed 2017-09-21] [WebCite Cache]
Eloquent Labs. 2017. URL: https://www.eloquent.ai/ [WebCite Cache]
Mamykina L, Smyth T, Dimond J, Gajos K. Learning from the crowd: Observational learning in crowdsourcing communities. 2016 Presented at: CHI Conference on Human Factors in Computing Systems - CHI 16; 2016; San Jose, CA, USA. [CrossRef]
Noronha J, Hysen E, Zhang H, Gajos K. Platemate. 2011 Presented at: 24th Annual ACM symposium on User interface software and technology - UIST 11; 2011; Santa Barbara, CA, USA. [CrossRef]
Kagaya H, Aizawa K. Highly accurate food/non-food image classification based on a deep convolutional neural network. 2015 Presented at: International Conference on Image Analysis and Processing; September 7, 2015; Genova, Italy p. 350-357. [CrossRef]
Kagaya H, Aizawa K, Ogawa M. Food Detection and Recognition Using Convolutional Neural Network. 2014 Presented at: Proceedings of the ACM International Conference on Multimedia - MM 14; 2014; Orlando, FL, USA. [CrossRef]
Meister K, Doyle M. Obesity and Food Technology. American Council on Science and Health; 2009. URL: https://www.scribd.com/document/37170221/Obesity-and-Food-Technology [WebCite Cache]
Tapsell LC, Neale EP, Satija A, Hu FB. Foods, nutrients, and dietary patterns: interconnections and implications for dietary guidelines. Adv Nutr 2016 Dec;7(3):445-454 [FREE Full text] [CrossRef] [Medline]
Rains SA, Hingle MD, Surdeanu M, Bell D, Kobourov S. A test of the risk perception attitude framework as a message tailoring strategy to promote diabetes screening. Health Commun 2018 Jan 26:1-8. [CrossRef] [Medline]

‎

BMI: body mass index

ML: machine learning

RFPM: remote food photography method

USDA: United States Department of Agriculture

Edited by G Eysenbach; submitted 08.11.17; peer-reviewed by S Partridge, R Rosenkranz; comments to author 15.03.18; revised version received 26.04.18; accepted 20.08.18; published 05.11.18

©Jun Zhou, Dane Bell, Sabrina Nusrat, Melanie Hingle, Mihai Surdeanu, Stephen Kobourov. Originally published in the Interactive Journal of Medical Research (http://www.i-jmr.org/), 05.11.2018.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Interactive Journal of Medical Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.i-jmr.org/, as well as this copyright and license information must be included.

This paper is in the following e-collection/theme issue:

Calorie Estimation From Pictures of Food: Crowdsourcing Study