Exercise 3 - Object Detection

Published: Monday Nov. 30.

Exercise 3: Object Detection

Setup

Follow all the same setup steps from Exercise 1 for setting up your robot, and developing through the dts exercises interface.

Finally, make sure your shell is up to date, and the first time you run things through the dts exercises interface add the --pull flag once locally and once on the robot:

$ dts exercises test -b ![ROBOT_NAME] --pull 
$ dts exercises test --sim --pull

You may choose to copy over any of the other lane following modules that we tuned for Exercises 1 and 2. You may borrow from someone else in the class. If you borrow the work of someone else, you must acknowledge them, for which they will receive a small bonus.

Additionally, in this case you will need more things locally, since you will be generating annotated data and training a deep neural network for object detection. It is recommended that you do so in a virtual environment.

To do so you should install virtualenv (pip install virtualenv) and then run from the object_detection folder:

$ virtualenv venv
$ source venv/bin/activate

We will be working in the object_detection exercises folder in dt-exercises.

Enter that directory, and run

$ ./setup.sh

This will clone the simulator locally into the the utils/data_collection folder, and do a number of other things such as install requirements, download the real robot dataset, and process it for you.

Data Collection

In order for your agent to work both in simulation and on the real robot, you will construct an annotated dataset that is comprised of labels from both modalities. In the case of real data, you should be grateful that two students from last year annotated a huge dataset of real images. When you ran setup.sh they were downloaded into a folder called dataset. We also wrote the transformation funtion for you in data_collection/data_transfer.py which prepares the data for you. For the simulated images, you will have to create those yourself automatically.

Do note that in this exercise, we don’t want to differentiate the objects from one another: they will all have the same class. Our images will include duckies, busses, trucks, and cones.

We thus have five classes:

0: background
1: duckie
2: cone
3: truck
4: bus

To collect our data, we’ll use the segmented flag in the simulator. Try running

$ ./utils/data_collection.py

which cycles between the segmented simulator and the normal one.

You will have to modify the code in ./utils/data_collection.py in order to get a good dataset.

First, you will notice that there is noise that needs to be removed (it appears as “snow”). You could consider using some basic image processing technique covered in class or some other open_cv functions to remove this.

Notice that when we’re in the segmented simulator, all the objects we’re interested in have the exact same color, and the lighting and domain randomization are turned off. Just like the data_collection.py file does, we can also turn the segmentation back off for the same position of the agent. In other words, we can essentially produce two 100% identical images, save for the fact that one is segmented and the other is not.

Then, to collect the dataset:

We want as many images as reasonable. The more data you have, the better your model, but also, the longer your training time.
We want to remove all non-classes pixels in the segmented images. You’ll have to identify the white lines, the yellow lines, the stop lines, etc, and remove them from the masks. Do the same for the coloured “snow” that appears in the segmented images.
We want to identify each class by the numbers mentioned above
We also want the bounding boxes, and corresponding classes.

Your dataset must respect a certain format. The images must be 224x224x3 images. The boxes must be in [xmin, ymin, xmax, ymax] format. The labels must be an np.array indexed the same way as the boxes (so labels[i] is the label of boxes[i]).

We want to be able to read your .npz, so you must respect this format:

img = data[f"arr_{0}"]
boxes = data[f"arr_{1}"]
classes = data[f"arr_{2}"]

Additionally, each .npz file must be identified by a number. So, if your dataset contains 1000 items, you’ll have npzs ranging from 0.npz to 999.npz.

Do note that even though your dataset images have to be of size 224x224, you are allowed to feed smaller or bigger images to your model. If you wish to do so, simply resize the images at train/test/validation time.

Model Training

Now we can train our model using the dataset that we have created.

The two files that you will need to modify are:

exercise_ws/src/object_detection/include/object_detection/model.py

which defines your model, and

utils/train.py

which will execute the training for you.

The way that you would like to do this is really up to. You can use TensorFlow or PyTorch. As an example, we provide this pytorch object detection tutorial.

If you follow the tutorial, in utils/train.py you will have to define your Dataset class. It should provide:

The bounding boxes for each class in each image (contrarily to the tutorial, you calculated this earlier in the Data collection part of this exercise)
The class labels for each bounding box
The normal, non-segmented image
An ID for the image (you should just use the index of the .npz)

You are also free to experiment with the model.py that you define. Note that here we don’t have masks annotated so we could use FasterRCNN instead of MaskRCNN:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, 5)

A note on the tutorial

Make sure to carefully read the tutorial. Blindly copying it won’t directly work. The training data it expects is very specific, and you should make sure that you follow its structure exactly.

Additionally, weirdly enough, the tutorial expects you to have some files that it does not link to.

Perhaps having a look (and a download) at these links might save you some time:

https://github.com/pytorch/vision/blob/master/references/detection/engine.py
https://github.com/pytorch/vision/blob/master/references/detection/coco_utils.py
https://github.com/pytorch/vision/blob/master/references/detection/transforms.py

You can also safely remove the evaluate call that the tutorial uses, and it will save you the headache of installing most of the coco_utils and coco_evaluate dependencies.

Verifying your Model

You should probably write a way to visually evaluate the performance of your model.

Something like displaying the input image and overlaying the bounding boxes (colored by class) would be simple but very effective.

You should also carefully read model.py, as there are comments in it that describe the API your wrapper should respect.

Training Hardware

But how should you actually train your model? If you have a recent-ish nvidia GPU, you can directly train on your computer. For reference, using a dataset with 2000 items, training on a GTX960 was very doable.

If you don’t have a GPU, or if your GPU is too slow, you can use Google Colab. We included a .ipynb in the notebooks directory. You can open it with Google Colab, upload the root of this exercise to your Google Drive, and the provided notebook will mount the folder from your drive into the Colab runtime, and then call your training script. To access the saved weights, simply download them from your Google Drive.

Integration

Finally, we will use the exercises infrastructure that we have used in Ex. 1 and 2 to integrate our model into a functional end-to-end robot capability. You will see quite a few packages in the exercise_ws but you should only need to worry about 1 (or maybe 2 if you consider the Bonus), that is the object_detection package.

The only real decision you will have to make in the object_detection_node is underwhat conditions to consider your detections to be worthy of reporting, and thus triggering a transition to the pedestrian avoidance behavior.

Optional Reading: What are the Other Packages Doing?

In order to integrate your object detector into the lane following stack, we have slightly modified the “Finite State Machine” of the system to include a new state DUCKIE_AVOID. If you look at the lane_following_pedestrians.yaml in the fsm/config directory, you can probably figure out how it works. When you publish your object detection to be True, it triggers a state change. In turn this causes a different controller to become active (this is regulated through the car_cmd_switch which basically acts as a multiplexer selecting between the different controllers - if you take a look at lane_following_pedestrians.yaml in the car_cmd_switch.config directory, you can probably figure out how it works). In this case the active controller is the pedestrian_avoidance_controller_node which is defined in the pedestrian_avoidance package. At present this controller just outputs 0 linear and angular velocity (makes the robot stop). The duckietown_demos package is included because we defined a new demo in the launch folder, lane_following_pedestrians.launch, that we are now launching instead of lane_following.launch.

Bonus (max 5%)

As mentioned in the optional reading, the pedestrian_avoidance_controller_node just outputs zeroes. A fun challenge could be to publish more detailed information about the whereabouts of the pedestrian in the object_detection_node and then use that information to do planning to navigate around the pedestrian. This will take some work though.

Deliverables

To submit your assignment,

you should make two different submissions to the aido5-LFP-sim-validation challenge (notice it’s LFP and not LF this time. One of them should be optimized to run in the simulator, and the other for on robot. You should change your submission’s label in the file submission.yaml to be user-label: sim-exercise-3 before submitting for the simulation, and user-label: real-exercise-3 for the real robot. The output that you get on the challenge server for the real-exercise-3 does not matter, we will run that submission on our own robots for the evaluation.

Note that you can use the same code for both submissions, but having two different submissions will allow you to tune parameters for the real robot and the simulator separately.

You should also submit a video of your code running on your robot with dts exercises test -b ![YOUR_ROBOT]. This will be useful in case a problem happens when we try to run your code on our robot. You can submit the video here.
Please send to Liam and Anthony and Charlie a link to your github repo through a private message on Slack. Also please mention whether you borrowed any of the other lane_following packages from someone else.

Grading

This assignment is worth 15% of your final grade. Compared with Exercise 1, more weight will be given to the intermediate outputs and less to the end-to-end performance of the agent.

4% creation of your dataset
5% training of your model
3% integration in simulation
3% integration on the Duckiebot

Please report problems quickly on discord, in class, on slack, as github issues, etc. If you are not sure if something is working properly ask. Don’t spend hours trying to fix it first.

Other Pointers, Helpers, and Things of Interest

You may find it helpful to look at what an existing dataset comprises. Try downloading the PennFudanPed dataset, a sample pedestrian detection dataset.

You’ll notice that if you try opening the masks in that dataset, your computer will display a black image. That’s because each segmented pedestrian’s mask is a single digit and the image only has one channel, even though the mask was saved as a .jpg.

Try scaling the masks from 0 to 255, using something like np.floor(mask / np.max(mask) * 255).astype(np.uint8). This will make the masks into something akin to a .bmp. Then, use opencv’s applyColorMap feature on that to visualize the results. Try looking at the two display functions found in utils.py for inspiration.

You’ll also notice that the dataset doesn’t include any bounding boxes. That’s okay. For training with PennFudanPed, we have to compute them through numpy and opencv, just like we will on your own dataset.

Actually, for our own training, we won’t need the masks! All we want are the bounding boxes. But PennFudanPed is a useful example, as it shows how we can extract bounding boxes from masks, something we will also do for our own dataset. To see how to do this, you may skip ahead to the tutorial linked in the Training section.

The deadline is officially set for Dec. 23, but no points will be deducted if it is received before Dec. 31.