Image-Based Calorie Estimation using Deep Learning

Published in

Leapfrog

6 min readJul 9, 2019

According to WHO almost 20% of deaths worldwide are attributable to an unhealthy diet. 39% of adults aged 18 years and over were overweight in 2016, and 13% were obese. Most of the world’s population live in countries where overweight and obesity kills more people than underweight.

The problem here is not about having enough food, it is about the people not knowing what’s in their diet. If people could estimate their calorie intake using the images of their food, they can easily decide on the amount of calories they want to consume. An image-based Calorie estimator built using deep learning can be a convenient app to keep track of what an individual’s diet plan contains

If people knew how much calories their food contains, then this problem will be somewhat controlled.

Proposed Solution

People often take pictures of their food before they eat and put it on social media, well the solution lies in that very process. We propose to estimate the calorie content in the user-provided image by identifying the food and estimating the quantity using deep learning.

To give an estimation of the calories we need accurate object detection combined with accurate IoU (intersection over the union). An impressive amount of IoU can be achieved using Single Shot Detections which are also faster than their counterparts but the problem is with the segmentation. We cannot approximate the amount with the output of Bounding Box, we need more precision. So, the solution would be to use Instance Segmentation.

How Was It Done?

First, we needed some data to fit the Mask-R-CNN model for image recognition using Machine learning. The data needed to be annotated with boundaries and classes for each food item on a plate. After looking around for a while I found the food images prepared by the University of Milano-Bicocca, Italy fitted our requirements, the dataset was called UNIMIB-2016. After some pre-processing the food image dataset was ready to be trained with Mask R-CNN.

Food-item Identification

To identify what’s on the plate, we need to instance-segment the given food image into the possible food categories. Instance Segmentation classifies individual pixel in the given picture into possible classes ie. foods in our case. Given the problem of instance segmentation, the architecture of Mask R-CNN would be a matching solution. Mask R-CNN takes an image and spits out three outputs, masks of the identified items, bounding boxes and classes for each mask detected. Masks are the binary coded single-channel matrices of the size of the input image which denote the boundaries of the identified object.

Mask R-CNN is developed based on Faster R-CNN, which is a region-based Convolutional Neural Network. A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and can differentiate one type of image from the other. To have object detection, we need to know the class of the object and also the bounding box size and location. Conventionally, for each image, there is a sliding window to search every position within the image as below. It is a simple solution. However, different objects or even the same kind of objects can have different aspect ratios and sizes depending on the object size and distance from the camera. And different image sizes also affect the effective window size. This process will be extremely slow if we use deep learning CNN for image classification at each location.

To bypass the problem of selecting a huge number of regions, Ross Girshick et al . proposed a method where we use selective search to extract just 2000 regions from the image and he called them region proposals. The architecture is called R-CNN.

The same author of the previous paper (R-CNN) solved some of the drawbacks of R-CNN to build a faster object detection algorithm and it was called Fast R-CNN. The approach is similar to the R-CNN algorithm. But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map.

Mask R-CNN extends the header to 3 branches compared to just 2 branches in Faster R-CNN, one additional branch of mask identification is added to the Faster R-CNN architecture. A mask image is simply an image where some of the pixel intensity values are zero, and others are non-zero, which determines the boundings of an object. Apart from that, Mask R-CNN uses ROI align which utilizes bilinear interpolation for Region Of Interest (ROI) compared to floor division used in Faster R-CNN which hugely misplaced masks at outputs but served sufficient accuracy for bounding box prediction.

After acquiring the food object and the mask, we need a fool-proof plan to estimate the size of the identified food too. So, to acquire the food size and thus estimate the calorie we need an object size estimation. Estimating object size through a pinhole camera is a tricky job, a reference object is needed for size approximation without using multiple-camera.

Food Calorie Estimation

As the same food can be taken at different depths to generate different picture sizes we need a method to calculate calorie or estimate the size of the food in a real-world scenario. After we get the desired food items detected along with their masks, we need the real object sizes, which is not possible through a pin-hole camera images alone. So, we take a referencing approach that references the food-objects to the size of the pre-known object to extract the actual size of the food contained in that specific image.

As the above demonstration uses the coin as a reference object, we propose to use a plate as a reference object for the estimation of the food detected in images. Plates can be detected using edge detection or include in training data to detect along with foods with a single network. After we detect the plates, the pixels_per_inch_sq is calculated using the actual size of the plate in real life.

pixels_per_inch_sq= plate_pixels_area / actual_plate_area real_food_area = masked_food_pixel_area / pixels_per_inch_sq

Conclusion

With this brief testing of food images dataset with Mask R-CNN, we can deduce that it is quite possible to achieve an application that is capable of estimating calories from food images. The application will have a tremendous impact on how people perceive a plate of food and will also impact the weight-loss and weight-management market.

My takeaways

After working on this project, I got a quick intro to deep learning. I was able to understand that there is a lot of area of applications where we can harness the capabilities of Mask R-CNN.

Looking back we would surely thank Leapfrog for giving us the opportunity and it would surely prove to be a great foundation for our careers. I also would like to thank the KC duo (Aviskar KC and Bipin KC) for being so outstanding mentors for my machine learning project.

About this project

This project is part of my internship project at Leapfrog. The 6-weeks internship program covered a broad area, from data analysis, predictive modeling to the core of machine learning and deep learning algorithms. The intensity of the program and the volume of knowledge we gained intrigued us. You can find my project for food calorie estimation on GitHub.

About the author

Binayak Pokhrel is a full-time Machine Learning Engineer at Leapfrog Technology. He has a deep fascination with AI; identifying a rooted problem and solving it with whatever it takes gets him going.