Efficient semantic segmentation for real-time applications

Scene understanding is a long-standing problem in the field of Computer Vision, which aims to extract useful information of a scene from raw sensor data and interpret the data content at the human understanding level. In the last decade, modern deep learning techniques enabled amazing developments of many Computer Vision tasks for purposive scene understanding at various levels of detail and abstraction. Semantic segmentation, as the main focus of this study, is an important tool for visual scene understanding that aims to assign a class label to each pixel in the image from a predefined set of classes. Semantic segmentation has shown critical usefulness to many applications, which need a precise, pixel-level understanding of their environment such as: autonomous vehicles, medical image analysis for computer-aided diagnosis, robot navigation, etc.

Recent advances in deep learning based semantic segmentation approaches show significant performance gains in terms of accuracy, which require high-end GPUs to run inferences in near real-time. However, some aspects of semantic segmentation such as computational efficiency have not been thoroughly studied. This is a challenging problem in many robotics platforms, where not only high-end GPUs are not always available, but also effectiveness and efficiency are highly required. This thesis discusses our work towards solutions to the following challenges, namely 1) the time and computation constraints in real time applications or systems with limited computational power such as autonomous driving applications, 2) incorporating spatial relationship and contextual information, along with other high-level extracted features to improve scene understanding 3) model generalizability capability to work properly for unseen similar domain, when the labelled data available for training models is small or limited. Considering these problems by emphasising on solutions, which permit inference time reduction, the following contributions are presented in this thesis:

1. First, the thesis introduces a novel neural network-based semantic segmentation model, that is both memory-efficient and fast, capable of running on a CPU. This model is efficient both in execution time and memory requirements, which consumes very low computation resources and can be embedded in real-time systems. By integrating limited prior information, such as hand-crafted features, into the input data-model, it becomes possible to achieve excellent segmentation results without the necessity of increasing network layering. Moreover, this approach helps avoid the burden of a large number of parameters and reduces computational efforts. To showcase the practicality of this real-time CPU segmentation model, we apply it to the challenge of urban scene segmentation. Specifically, we focus on achieving efficient and highly accurate road segmentation, which holds significant potential for intelligent vehicle applications in creating a safe drivable environment. It surpasses GPU-based state-of-the-art semantic segmentation performance at really fast rates.

2. To improve the precision of the proposed model, a graph-based image segmentation technique is employed, incorporating contextual information and spatial dependencies at minimal additional cost. Various optimization algorithms, including approximate inference procedures, are investigated to enhance the segmentation results within the specific graph-based neighbourhood model. The introduced method achieves state-of-the-art performance on benchmark datasets for road semantic segmentation, making it suitable for CPU or low-end GPUs.

 

3. The third contribution tackles three important challenges: a) the scarcity of training data, where only a small number of fully labelled images are available along with a large set of unlabelled data, b) the need to improve the generalizability of the model to work effectively on unseen but similar domains, and c) the necessity to enhance the model’s robustness in the presence of context-changing factors, such as shadows on the road surface. To address these challenges, we propose a novel semi-supervised semantic segmentation method in conjunction with our previously introduced technique, resulting in fine-grained semantic segmentation results. This method leverages an unsupervised image-to-image translation technique, which aims to learn the translation between two visual domains without relying on paired data. By utilizing this approach, we demonstrate the effectiveness of our technique in road segmentation, a task known for its complexity due to the resemblance of roads to other patterns like walking areas, grass, and the presence of shadows or vehicles on the road surface. Notably, by addressing the issues of limited labelled data, enhancing generalizability, and improving robustness in the face of context-changing factors, our method achieves comparable performance to state-of-the-art methods, while operating efficiently on low-end GPUs, requiring low computational efforts.

Our methods have been tested on several public semantic segmentation datasets for autonomous driving and evaluated by well-known segmentation evaluation metrics. Experiments conducted on each method provide compelling evidence that all of our approaches produce more efficient semantic segmentation results compared to the state-of-the-arts methods.

Cite

Citation style:
Could not load citation form.

Rights

Use and reproduction:
All rights reserved