R-CNN

Goal

Object Detection:
Given an input image, job is to implement a model that has to output what’s the object inside the image
Also predict a bounding box coordinate
-How to implement bounding box?
Sorting their top left coordinates x and y, their corresponding widths and heights
How can implement model?
Object Localization:
Classify and detect only one object in the image
(1)Receive an input image
(2)Reshape the image to a square with the size of 3X224X224
(3)Pass it to through a couple of conv and pooling layers
-Capture all the necessary features existing in the image
-Result is 4X32X32
(4)Flatten all the tensors
-Can flatten the tensors or use global average pooling
-Have a vector describing all the features extracted from previous layers
-Result is 4096X1
Classification Branch
(5)Fully connected layer1(Classifer) & Class scores
-Use a fully connected classifier at the end
-Result is 4096X1 to 1000X1
-Give us class scores for every possible object you might expect
-Class scores example
Car:0.90, Bike:0.01, Motorcycle:0.03,
It has a car and its probability is 90% that there is a car in the picture
(6)Cross-Entrophy Loss & Correct label:Car
-It is a supervised learning task so we utilize the correct label and compute its loss function
Object Detection Branch
(5)’Fully connected layer2 & Class scores
-Use another fully connected layer to output our bounding box coordinates
-Result is 4096X1 to 4X1
-Looks like (x, y, w, h)
(6)’Box Coordinates & L2 loss
-Already know the correct bounding box during training and we use L2 loss as a loss function
Footnote:
Step1 ~ Step7 is an object classification task
(7)Weighted Sum
Use Weighted sum to compute the final loss
-One major disadvantage:
Can detect only one object in the picture, because we output just one bounding box coordinate
So actually can’t use in scenarios when there are multiple objects such as cars in our input
Earliest Approach
Have a picture, and a sliding window
R-CNN1
(1)Put the sliding window at the top left corner, and tell it classify the region
R-CNN2
(2)Extract the region as an image
(3)Pass it to a neural network classifier
-Neural network classifier is an existing CNN module
(4)Produces c+1 output
-c: Expected classes we already know
-+1: Background, because the specific region might not have an object in there and in such cases, we want to classify it as a background
ex. sky without anything
(5)Slide our sliding window to every possible locations
R-CNN3
-ex. Extract the region and classifies that there is a mountain in it
Repeat the process (Slide -> Extract -> Classification)
Possible Problem:

Reference:
https://www.youtube.com/watch?v=nJzQDpppFj0&t=23s