Suman Saha Home Page

Suman Saha

Postdoctoral Researcher, Computer Vision Lab. ETH Zürich, Switzerland

I am a postdoctoral researcher at CVL (Computer Vision Lab) ETH Zürich, Switzerland. My research interests lie in Computer Vision and Machine Learning. In particular, most of my research works are based on deep learning techniques to solve computer vision problems. Deep learning is a type of machine learning where an AI model learns to recognize patterns in data by using multiple layers of artificial neurons to build a complex hierarchy of representations. In simpler terms, deep learning is a technique used to train AI models to make decisions or predictions based on input data. These models use artificial neural networks to learn from large amounts of data and improve their performance over time. By using multiple layers of neurons, deep learning models can learn to recognize more complex patterns in the data than traditional machine learning models.
My research could be broadly categorized into the following three areas:
The first one is UDA. UDA stands for Unsupervised Domain Adaptation, which refers to the process of training an AI model to work well on new, unseen data from a different domain without requiring labeled data from that domain. In simpler terms, UDA allows an AI model that was trained on one type of data to work well on a different type of data without needing to label the new data first. This can be useful in many situations, such as when a model is trained on data from one country or language and needs to work well in another country or language or when a model needs to work well on data from a different source, such as images from a different camera or sensor. I have studied UDA for semantic and panoptic segmentation, human action detection, and face-anti spoofing.
The second area is self-supervised and semi-supervised learning for semantic segmentation. I have explored self-supervised learning for monocular depth estimation (also known as self-supervided depth estimation), and semi-supervised learning for semantic segmentation. Self-supervised learning is a type of machine learning in which an AI model learns to predict some aspect of its input data without requiring explicit supervision or labels. In other words, the model is trained on a large dataset without being explicitly told what the correct output should be. Instead, it uses the patterns and relationships present in the data to make predictions about the data itself. Whereas, semi-supervised learning uses a combination of labeled data (data that has been explicitly marked with the correct output) and unlabeled data (data that has not been labeled) to train an AI model. The model learns to recognize patterns in the labeled data and uses that knowledge to make predictions about the unlabeled data. This approach can be especially useful when there is a limited amount of labeled data available but there is a large amount of unlabeled data. By leveraging the unlabeled data, the AI model can learn more about the underlying structure of the data and improve its accuracy in making predictions.
And the third one is unsupervised deep generative learning. I have studied deep generative learning for human facial behavior analysis. In particular, I build a deep generative model based on VAE and GAN frameworks. Deep generative learning uses a neural network to learn the underlying patterns and relationships in a set of data and then uses that knowledge to create new data that is similar to the original data. This can be used to generate new images, text, or other types of data that resemble the original data. Deep generative learning models can be trained in a variety of ways, such as using generative adversarial networks (GANs) or variational autoencoders (VAEs). These models can be used for a variety of applications, such as creating realistic images for computer graphics, generating new music or speech, or even for data augmentation to improve the performance of other machine learning models.
I also spent some time tackling research problems in multi-task learning (MTL). More specifically, two common challenges in developing multi-task models, i.e., incremental learning and task interference, are addressed. Multi-task learning is a type of machine learning where an AI model is trained to perform multiple tasks simultaneously by sharing some or all of its parameters across different tasks. In simpler terms, multi-task learning allows an AI model to learn how to perform multiple tasks at the same time, rather than training separate models for each task. By sharing some or all of its parameters across different tasks, the model can learn to recognize patterns and relationships that are common to all of the tasks, which can improve its performance on each individual task.
Click here to read more about me ...

News

Qualitative Domain-Adaptive Semantic Segmentation Results (on SYNTHIA to Cityscapes UDA benchmark) of our CVPR2021 CTRL-UDA Model:
Our WACV 2023 Domain-Adaptive Human Action Detection Paper Presentation Video:
We have released new UDA (unsupervised domain adaptation) human action detection benchmarks

Details about the new UDA benchmarks can be found in our WACV 2023 DA-AIM paper "Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection" . We have created a new human action detection dataset for one of the UDA benchmarks proposed in our WACV 2023 DA-AIM paper. Visualization of the ground truth bounding boxes and their action class labels are shown below for some selected videos of our new dataset.
Qualitative Domain-Adaptive Action Detection Results of our DA-AIM WACV2023 Model

Publications

EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation

Suman Saha , Lukas Hoyer, Anton Obukhov, Dengxin Dai, and Luc Van Gool

arxiv.org 2023

| arXiv | Code |

Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection

Yifan Lu, Gurkirt Singh, Suman Saha , Luc Van Gool

WACV 2023

Spatio-Temporal Action Detection Under Large Motion

Gurkirt Singh, Vasileios Choutas, Suman Saha , Fisher Yu, Luc Van Gool

WACV 2023

| pdf | arXiv | Video |

Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation

Suman Saha , Anton Obukhov, Danda Pani Paudel, Menelaos Kanakis, Yuhua Chen, Stamatios Georgoulis, Luc Van Gool

CVPR 2021

Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation

Lukas Hoyer, Dengxin Dai, Yuhua Chen, Adrian Köring, Suman Saha , Luc Van Gool

CVPR 2021

Unsupervised Compound Domain Adaptation for Face Anti-Spoofing

Ankush Panwar, Pratyush Singh, Suman Saha , Danda Pani Paudel, Luc Van Gool

FG 2021

| pdf | arXiv |

Road: The road event awareness dataset for autonomous driving

Gurkirt Singh, Suman Saha, Fabio Cuzzolin et al.

IEEE TPAMI 2021

pdf |

Reparameterizing Convolutions for Incremental Multi-Task Learning Without Task Interference

Menelaos Kanakis, David Bruggemann, Suman Saha, Stamatios Georgoulis, Anton Obukhov, Luc Van Gool

ECCV 2020

| pdf | arXiv | Code |

Domain Agnostic Feature Learning for Image and Video Based Face Anti-spoofing

Suman Saha , Wenhao Xu, Menelaos Kanakis, Stamatios Georgoulis, Yuhua Chen, Danda Pani Paudel, Luc Van Gool

CVPR WORKSHOP 2020

Book chapter title: Spatio-Temporal Action Instance Segmentation and Localisation

Suman Saha , Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, Fabio Cuzzolin

Book title: Modelling Human Motion: From Human Perception to Robot Design

Publisher: Springer International Publishing, pages: 141-161, ISBN: 978-3-030-46732-6, year: 2020.

| Springer Book | Project |

Two-Stream AMTnet for Action Detection

Suman Saha , Gurkirt Singh, Fabio Cuzzolin

arXiv 2020.

| arxiv |

Unsupervised Deep Representations for Learning Audience Facial Behaviors

Suman Saha, Rajitha Navarathna, Leonhard Helminger, Romann M. Weber

CVPR 2018 Workshops

| pdf | arXiv | Poster |

Predicting Action Tubes

Gurkirt Singh, Suman Saha, Fabio Cuzzolin

ECCV 2018 Workshops

| arXiv |

Incremental Tube Construction for Human Action Detection

Harkirat Singh Behl, Michael Sapienza, Gurkirt Singh, Suman Saha, Fabio Cuzzolin, Philip H. S. Torr

BMVC 2018 (Oral)

| arXiv |

Spatio-temporal Human Action Detection and Instance Segmentation in Videos

Suman Saha

PhD thesis, Oxford Brookes University, United Kingdom, 2018

| PhD Thesis PDF | PhD Thesis Defense Slides |

TraMNet - Transition Matrix Network for Efficient Action Tube Proposals

Gurkirt Singh, Suman Saha, Fabio Cuzzolin

ACCV 2018

| arxiv |

Action Detection from a Robot-Car Perspective

Valentina Fontana, Manuele Di Maio, Stephen Akrigg, Gurkirt Singh, Suman Saha, Fabio Cuzzolin

arXiv 2018

| arxiv |

AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture

Suman Saha, Gurkirt Singh, Fabio Cuzzolin

ICCV 2017

Online Real-time Multiple Spatiotemporal Action Localisation and Prediction

Gurkirt Singh, Suman Saha, Michael Sapienza, Philip H. S. Torr, Fabio Cuzzolin

ICCV 2017

Spatio-temporal human action localisation and instance segmentation in temporally untrimmed videos

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, Fabio Cuzzolin

arXiv, 2017

| Project Page Link | arxiv |

Metric learning for Parkinsonian identification from IMU gait measurements

Fabio Cuzzolin, Michael Sapienza, Patrick Esser, Suman Saha, Miss Marloes Franssen, Johnny Collett, Helen Dawes

Gait & Posture, Volume 54, May 2017, Pages 127-132

| ScienceDirect link |

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, Fabio Cuzzolin

BMVC 2016

A real-time monocular vision-based frontal obstacle detection and avoidance for low cost UAVs in GPS denied environment

Suman Saha, Ashutosh Natraj, Sonia Waharte

Aerospace Electronics and Remote Sensing Technology (ICARES), 2014 IEEE International Conference on

| pdf | project page link |

Face Recognition using PCA and Multilayer Feedforward Neural Networks

Suman Saha

European Journal of Applied Sciences and Technology [EUJAST] Volume 1 (1), March 2014

| pdf |

A Monocular Vision Approach for Obstacle Detection and Collision Avoidance for Low-cost Quadrocopters

Suman Saha

MSc Thesis , University of Bedfordshire, Uited Kingdom. January 2014

| MSc Thesis | MSc Defense Poster | project page link |

Rsearch Activities and Awards

CondConv: Conditionally Parameterized Convolutions for Efficient Inference, NIPS 2019

Reading group presentation, Computer Vision Lab. ETH Zurich, May 20th 2020. [Google Slides]
Adaptive Activation Network and Functional Regularization for Efficient and Flexible Deep Multi-Task Learning, AAAI 2020

Reading group presentation, Computer Vision Lab. ETH Zurich, December 11th 2019. [pdf]
Deep Virtual Networks for Memory Efficient Inference of Multiple Tasks, CVPR 2019

Reading group presentation, Computer Vision Lab. ETH Zurich, August 21st 2019. [pdf]
Hybrid Task Cascade (HTC) for Instance Segmentation , CVPR 2019

Reading group presentation, Computer Vision Lab. ETH Zurich, APril 17th 2019. [pdf]
Multimodal Unsupervised Image-to-Image Translation , ECCV 2018

Reading group presentation, Computer Vision Lab. ETH Zurich, December 12th 2018. [pdf]
Two-Stream AMTnet for Action Detection, arxiv 2019

Presentation, Computer Vision Lab. ETH Zurich, October 2nd 2018. [pdf]

Spatio-temporal human action localisation

Presentation at the Robotics research group seminar, Oxford Brookes University, United Kingdom, February 2016. [pdf]
Advance Computer Vision

Introductory lecture delivered to MSc students, Oxford Brookes University, United Kingdom, January 2016. [pdf]
Deep Learning Approach for Human Action Detection from Video

A survey report submitted to the Department of Computing and Communication Technologies, Oxford Brookes University, United Kingdom, September, 2015. [pdf]
Performance analysis on temporal tubes

A benchmarking report submitted to the Department of Computing and Communication Technologies, Oxford Brookes University, United Kingdom, 2014. [pdf]
Streaming hierarchical graph based video segmentation: A step-by-step guide

Presentation at the Robotics research group seminar, Oxford Brookes University, United Kingdom, November, 2014. [pdf]
Streaming hierarchical graph based video segmentation

Presentation at the Robotics research group seminar, Oxford Brookes University, United Kingdom, November, 2014. [pdf]
Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs)

Attended reading groups in the Department of Engineering Science at University of Oxford, June 2016.
Visualizing and Understanding Recurrent Neural Networks by Andrej Karpathy

Attended reading groups in the Department of Engineering Science at University of Oxford, July 2015.
Actively participated in the TeenTech event at London, UK.

1st December 2015.
Attended the Ambassadors visit at Gipsy Lane campus, Oxford Brookes University, UK.

30th September, 2015. [Event Link]
Actively participated in the BBC News live broadcast BBC broadcasting hub, London UK.

14th September, 2015.

Oxford Brookes Research Team BBC broadcasting hub, London UK:

Video1:

Video2:
Actively participated in The Magna Carta event, Oxford UK.

18th June, 2015
Actively participated in the humanoid robot Artie and the Naos at the Outburst festival

9th May, 2015

Received The Best Paper Award at CVPR 2020 Biometric Workshop
[Award Link]
Received The Internation Next 10 Studentship Award

Funding for full-time 3 years PhD in the Department of Computing and Communication Technologies at Oxford Brookes University, UK. [Award Link]
Received Overseas PhD Scholarship award from Department of Computer Science, Aberystwyth University, UK.

Declined. [Award Link]
Best Overall Performance Award for MSc Degree Course

3rd April, 2014. [Award Link]
Best Masters Project award MSc Degree Course

3rd April, 2014. [Award Link]
Merit Scholarship Award MSc Degree Course
[Award Link]
Won the reading group competition at ICVSS 2015 Summer School.
Best Overall Performance award for Masters degree course

received from Vice Chancellor and Chief Executive Bill Rammell, University of Bedfordshire, video taken on 3rd April, 2014.
Msc dissertation outcome

The proposed obstacle detection and avoidance algorithm is deployed in realtime.

More About Me ...

Before joining ETH Zürich, I was a Research Associate (RA) or a postdoctoral researcher at the Department of Computing and Communication Technologies, Oxford Brookes University, where I spent four wonderful years (included my PhD studies).
I have received my PhD degree under the supervision of Professor Fabio Cuzzolin at Oxford Brookes University, United Kingdom. Professor Nigel Crook and Dr Tjeerd Olde Scheper where my PhD co-supervisors.
My PhD thesis topic was Spatio-temporal Human Action Detection and Instance Segmentation in Videos. The two main objectives of my PhD thesis were to propose: (1) efficient algorithms to locate (in space and time) multiple co-occurring human action instances present in realistic videos; (2) powerful video level deep feature representation to improve the state-of-the-art action detection accuracy.
Besides, I was an active member of the Artificial Intelligence and Vision Research Group led by Professor Fabio Cuzzolin. I consider myself fortunate to have an opportunity to work closely with the world renowned Torr Vision Group (TVG) in the Department of Engineering Science at University of Oxford. More specifically, during my PhD, I worked with my PhD guide Dr Michael Sapienza and Professor Philip H. S. Torr. who is the founder of TVG.
During summer 2017, I received a wonderful opportunity to work with Dr Romann Weber Senior Research Scientist and Head of Machine Intelligence and Data Science Group at Disney Research Zurich (DRZ). At DRZ, I wokred for the project named unsupervised and semi-supervised learning of audience facial expressions using deep generative models. We improved the classification accuracy by 9% over the existing method.

I have completed my Master's study from the Department of Computer Science and Technology, University of Bedfordshire (UoB), United Kingdom. During my MSc thesis work (i.e., in 2013-2014), I proposed a novel realtime algorithm for frontal obstacle detection and avoidance for low cost unmanned aerial vehicles (UAVs). The related publication can be accessed using this link. My MSc thesis supervisors were Dr Ashutosh Natraj and Sonia Waharte , post doctoral researchers in the Department of Computer Science, University of Oxford.

Before pursuing my Master's study in UK, I worked as a Software Analyst at the the Research and Development and Scientific Services division, Tata Steel Ltd. India. At R&D Tata Steel, I worked under the supervision of DR. Sumitesh Das, Chief (Global Research Programmes) at Tata Steel Ltd. My CV can be viewed by clicking this link.

I received my Polytechnic Diploma Engineering degree in Computer Science from Siddaganga Polytechnic College, India.

Contact

Suman Saha,
Postdoctoral Research Fellow,
Room No. ETF D113,
Computer Vision Lab. (CVL)
ETH Zürich, Switzerland
Email: suman.saha [at] vision (dot) ee [dot] ethz (dot) ch

News

Qualitative Domain-Adaptive Semantic Segmentation Results (on SYNTHIA to Cityscapes UDA benchmark) of our CVPR2021 CTRL-UDA Model:

Our WACV 2023 Domain-Adaptive Human Action Detection Paper Presentation Video:

We have released new UDA (unsupervised domain adaptation) human action detection benchmarks

Qualitative Domain-Adaptive Action Detection Results of our DA-AIM WACV2023 Model

Publications

EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation

Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection

Spatio-Temporal Action Detection Under Large Motion

Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation

Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation

Unsupervised Compound Domain Adaptation for Face Anti-Spoofing

Road: The road event awareness dataset for autonomous driving

Reparameterizing Convolutions for Incremental Multi-Task Learning Without Task Interference

Domain Agnostic Feature Learning for Image and Video Based Face Anti-spoofing

Book chapter title: Spatio-Temporal Action Instance Segmentation and Localisation

Two-Stream AMTnet for Action Detection

Unsupervised Deep Representations for Learning Audience Facial Behaviors

Predicting Action Tubes

Incremental Tube Construction for Human Action Detection

Spatio-temporal Human Action Detection and Instance Segmentation in Videos

TraMNet - Transition Matrix Network for Efficient Action Tube Proposals

Action Detection from a Robot-Car Perspective

AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture

Online Real-time Multiple Spatiotemporal Action Localisation and Prediction

Spatio-temporal human action localisation and instance segmentation in temporally untrimmed videos

Metric learning for Parkinsonian identification from IMU gait measurements

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

A real-time monocular vision-based frontal obstacle detection and avoidance for low cost UAVs in GPS denied environment

Face Recognition using PCA and Multilayer Feedforward Neural Networks

A Monocular Vision Approach for Obstacle Detection and Collision Avoidance for Low-cost Quadrocopters

Rsearch Activities and Awards

More About Me ...

Contact