1. Eye in the Sky: Real-time Drone Surveillance System (DSS) for Violent Individuals Identification using ScatterNet Hybrid Deep Learning Network
https://arxiv.org/abs/1806.00746
2. 3D human pose estimation in video with temporal convolutions and semi-supervised training
https://arxiv.org/abs/1811.11742
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, Founder and CEO of Auviz Systems, presents the "Semantic Segmentation for Scene Understanding: Algorithms and Implementations" tutorial at the May 2016 Embedded Vision Summit.
Recent research in deep learning provides powerful tools that begin to address the daunting problem of automated scene understanding. Modifying deep learning methods, such as CNNs, to classify pixels in a scene with the help of the neighboring pixels has provided very good results in semantic segmentation. This technique provides a good starting point towards understanding a scene. A second challenge is how such algorithms can be deployed on embedded hardware at the performance required for real-world applications. A variety of approaches are being pursued for this, including GPUs, FPGAs, and dedicated hardware.
This talk provides insights into deep learning solutions for semantic segmentation, focusing on current state of the art algorithms and implementation choices. Gupta discusses the effect of porting these algorithms to fixed-point representation and the pros and cons of implementing them on FPGAs.
This document discusses guided image filtering. It introduces the guided filter, which performs edge-preserving smoothing while maintaining the gradient of a guidance image. The guided filter works by assuming a local linear model between the guidance image and filtering output within a window, and solving a cost function to determine the filter coefficients. It can perform edge-preserving smoothing and gradient-preserving filtering in linear time complexity.
Semantic Segmentation Methods using Deep LearningSungjoon Choi
?
This document discusses semantic segmentation, which is the task of assigning each pixel in an image to a semantic class. It introduces semantic segmentation and provides a leader board of top performing models. It then details the results of various semantic segmentation models on benchmark datasets, including PSPNet, DeepLab v3+, and DeepLab v3. The models are evaluated based on metrics like mean intersection over union.
猟B初Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...Toru Tamaki
?
Shreyank N Gowda, Marcus Rohrbach, Frank Keller, Laura Sevilla-Lara, "Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition" ECCV2022
https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136910234.pdf
https://arxiv.org/abs/2206.04790
Deep learning based object detection basicsBrodmann17
?
The document discusses different approaches to object detection in images using deep learning. It begins with describing detection as classification, where an image is classified into categories for what objects are present. It then discusses approaches that involve separating detection into a classification head and localization head. The document also covers improvements like R-CNN which uses region proposals to first generate candidate object regions before running classification and bounding box regression on those regions using CNN features. This helps address issues with previous approaches like being too slow when running the CNN over the entire image at multiple locations and scales.
際際滷s from the UPC reading group on computer vision about the following paper:
Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." arXiv preprint arXiv:1506.02640 (2015).
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
?
The document summarizes a research seminar presentation on using transformers for image recognition without convolutional biases. It discusses how a pure transformer architecture called Vision Transformer (ViT) can achieve state-of-the-art image classification performance when pretrained on large datasets. ViT works by splitting images into patches and treating the sequence of patch embeddings with a standard transformer. Experiments show ViT outperforms convolutional models in performance per computation and can learn spatial representations without explicit inductive biases. While limited to classification, ViT shows potential for vision tasks if pretrained self-supervision and model extensions are improved.
The document discusses R-CNN, a framework for object detection in images using convolutional neural networks. It introduces R-CNN and its components, including region proposal using selective search, feature extraction from proposed regions using a CNN, and classifying regions using an SVM. Later developments like Fast R-CNN and Faster R-CNN improved upon R-CNN by making object detection faster and joint training end-to-end.
The document discusses the Vision Transformer (ViT) model for computer vision tasks. It covers:
1. How ViT tokenizes images into patches and uses position embeddings to encode spatial relationships.
2. ViT uses a class embedding to trigger class predictions, unlike CNNs which have decoders.
3. The receptive field of ViT grows as the attention mechanism allows elements to attend to other distant elements in later layers.
4. Initial results showed ViT performance was comparable to CNNs when trained on large datasets but lagged CNNs trained on smaller datasets like ImageNet.
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
?
SSD is a single-shot object detector that processes the entire image at once, rather than proposing regions of interest. It uses a base VGG16 network with additional convolutional layers to predict bounding boxes and class probabilities at three scales simultaneously. SSD achieves state-of-the-art accuracy while running significantly faster than two-stage detectors like Faster R-CNN. It introduces techniques like default boxes, hard negative mining, and data augmentation to address class imbalance and improve results on small objects. On PASCAL VOC 2007, SSD detects objects at 59 FPS with 74.3% mAP, comparable to Faster R-CNN but much faster.
Speeding Up Minwise Hashing for Weighted SetsOtmar Ertl
?
Minwise hashing (MinHash) has become a standard tool for calculating signatures (fingerprints) of sets that is used in many applications for similarity estimation and nearest neighbor search. Generalizations have been proposed that are able to calculate signatures for weighted sets and allow estimating either the weighted Jaccard similarity or the probability Jaccard similarity. While there are already very fast algorithms for calculating signatures of unweighted sets, until recently there were no such algorithms for weighted sets. In this talk, the basic ideas of the latest weighted minwise hashing algorithms BagMinHash, DartMinHash, TreeMinHash, and ProbMinHash are presented. All of them have been developed only in the last two years and can reduce the computation costs by many orders of magnitude.
AlexNet achieved unprecedented results on the ImageNet dataset by using a deep convolutional neural network with over 60 million parameters. It achieved top-1 and top-5 error rates of 37.5% and 17.0%, significantly outperforming previous methods. The network architecture included 5 convolutional layers, some with max pooling, and 3 fully-connected layers. Key aspects were the use of ReLU activations for faster training, dropout to reduce overfitting, and parallelizing computations across two GPUs. This dramatic improvement demonstrated the potential of deep learning for computer vision tasks.
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
?
The document summarizes a research seminar presentation on using transformers for image recognition without convolutional biases. It discusses how a pure transformer architecture called Vision Transformer (ViT) can achieve state-of-the-art image classification performance when pretrained on large datasets. ViT works by splitting images into patches and treating the sequence of patch embeddings with a standard transformer. Experiments show ViT outperforms convolutional models in performance per computation and can learn spatial representations without explicit inductive biases. While limited to classification, ViT shows potential for vision tasks if pretrained self-supervision and model extensions are improved.
The document discusses R-CNN, a framework for object detection in images using convolutional neural networks. It introduces R-CNN and its components, including region proposal using selective search, feature extraction from proposed regions using a CNN, and classifying regions using an SVM. Later developments like Fast R-CNN and Faster R-CNN improved upon R-CNN by making object detection faster and joint training end-to-end.
The document discusses the Vision Transformer (ViT) model for computer vision tasks. It covers:
1. How ViT tokenizes images into patches and uses position embeddings to encode spatial relationships.
2. ViT uses a class embedding to trigger class predictions, unlike CNNs which have decoders.
3. The receptive field of ViT grows as the attention mechanism allows elements to attend to other distant elements in later layers.
4. Initial results showed ViT performance was comparable to CNNs when trained on large datasets but lagged CNNs trained on smaller datasets like ImageNet.
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
?
SSD is a single-shot object detector that processes the entire image at once, rather than proposing regions of interest. It uses a base VGG16 network with additional convolutional layers to predict bounding boxes and class probabilities at three scales simultaneously. SSD achieves state-of-the-art accuracy while running significantly faster than two-stage detectors like Faster R-CNN. It introduces techniques like default boxes, hard negative mining, and data augmentation to address class imbalance and improve results on small objects. On PASCAL VOC 2007, SSD detects objects at 59 FPS with 74.3% mAP, comparable to Faster R-CNN but much faster.
Speeding Up Minwise Hashing for Weighted SetsOtmar Ertl
?
Minwise hashing (MinHash) has become a standard tool for calculating signatures (fingerprints) of sets that is used in many applications for similarity estimation and nearest neighbor search. Generalizations have been proposed that are able to calculate signatures for weighted sets and allow estimating either the weighted Jaccard similarity or the probability Jaccard similarity. While there are already very fast algorithms for calculating signatures of unweighted sets, until recently there were no such algorithms for weighted sets. In this talk, the basic ideas of the latest weighted minwise hashing algorithms BagMinHash, DartMinHash, TreeMinHash, and ProbMinHash are presented. All of them have been developed only in the last two years and can reduce the computation costs by many orders of magnitude.
AlexNet achieved unprecedented results on the ImageNet dataset by using a deep convolutional neural network with over 60 million parameters. It achieved top-1 and top-5 error rates of 37.5% and 17.0%, significantly outperforming previous methods. The network architecture included 5 convolutional layers, some with max pooling, and 3 fully-connected layers. Key aspects were the use of ReLU activations for faster training, dropout to reduce overfitting, and parallelizing computations across two GPUs. This dramatic improvement demonstrated the potential of deep learning for computer vision tasks.
[paper review] ??? - Eye in the sky & 3D human pose estimation in video with TCN, semi-supervised training
1. Eye in the Sky: Real-time Drone Surveillance System
(DSS) forViolent Individuals Identi?cation using
ScatterNet Hybrid Deep Learning Network
Amarjot Singh et al.
???
????? ???????
Data Science & Business Analytics ???
2. 0. Summary
1. Feature Pyramid Network
2. SHDL networks - Human pose estimation
3. Support Vector Machine - Detect violent individuals
4. Aerial Violent Individual(AVI) dataset
5. Experiments
Index
3. 0. Summary
1. FPN?? human region ??
2. SHDL network? human region?? keypoint ??? regression
3. Key-point? ???? ?? ?? ??
3
21. 3D human pose estimation in video with temporal
convolutions and semi-supervised training
Dario Pavllo et al.
???
????? ???????
Data Science & Business Analytics ???
30. 3. Semi-supervised approach
Trajectory model
30
?Trajectory model? 2D pose? ?????
=> 3D trajectory? ???? ????
?? ??? ??? 2D -> 3D mapping? ???
trajectory ??? ??
?Unlabled data? back projection? ??
3D trajectory?? ???? reconstruct
?Back projection? ???? ?? ???
Reconstruction error
31. 3. Semi-supervised approach
Loss function
31
?Supervised loss
?3D Ground truth? MPJPE ??
?Global trajectory loss
?Camera?? Ground-truth depth??
??? ?? ?? ???? ??
?Weighted Mean Per-Joint?
Position Error(WMPJPE) ??
E =
1
yz
||f(x) ? y|| Reconstruction error