A digital spectrometer using an FPGA is proposed for use on a radio telescope. The spectrometer would provide high-resolution spectral analysis of wideband radio frequency signals received by the telescope. To achieve high throughput on the FPGA, a nested residue number system is used to implement the fast Fourier transforms in the spectrometer. This decomposes large moduli into smaller nested ones, allowing uniform circuit sizes and enabling fully parallel implementation of the arithmetic.
A Random Forest using a Multi-valued Decision Diagram on an FPGaHiroki Nakahara
?
The ISMVL (Int'l Symp. on Multiple-Valued Logic) presentation slide on May, 22nd, 2017 at Novi Sad, Serbia. It is a kind of machine learning to realize a high-performance and low power.
猟B初Dueling network architectures for deep reinforcement learningKazuki Adachi
?
Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1995-2003, 2016.
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...Hiroki Nakahara
?
This document presents a method for high-throughput convolutional neural network (CNN) inference on an FPGA using customized JPEG compression. It decomposes convolutions using channel shift and pointwise operations, employs binary weight quantization, and uses a fully pipelined architecture. Experimental results show the proposed JPEG compression achieves an 82x speedup with 0.3% accuracy drop. When implemented on an FPGA, the CNN achieves 3,321 frames per second at 75 watts, providing over 100x and 10x speedups over CPU and GPU respectively.
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...Hiroki Nakahara
?
The document discusses implementing a deep neural network object detector called YOLOv2 on an FPGA using a technique called Nested Residue Number System (NRNS). Key points:
1. YOLOv2 is used for real-time object detection but requires high performance and low power.
2. NRNS decomposes large integer operations into smaller ones using a nested set of prime number moduli, enabling parallelization on FPGA.
3. The authors implemented a Tiny YOLOv2 model using NRNS on a NetFPGA-SUME board, achieving 3.84 FPS at 3.5W power and 1.097 FPS/W efficiency.
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural NetworkHiroki Nakahara
?
This document summarizes a research paper that proposes a ternary weight binary input convolutional neural network (CNN).
The paper proposes using ternary (-1, 0, +1) weights instead of binary weights to improve recognition accuracy over binary CNNs. By setting many weights to zero, computations can be skipped, reducing operations. Experimental results show the ternary CNN model reduced non-zero weights to 5.3% while maintaining accuracy comparable to binary CNNs. Implementation on an ARM processor demonstrated the ternary CNN was 8 times faster than a binary CNN.
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...Hiroki Nakahara
?
This document presents a mixed-precision convolutional neural network (CNN) called a Lightweight YOLOv2 for real-time object detection on an FPGA. The network uses binary precision for the feature extraction layers and half precision for the localization and classification layers. An FPGA implementation of the network achieves 40.81 FPS for object detection, outperforming an embedded GPU and CPU. Future work will apply this approach to other CNN-based applications such as semantic segmentation and pose estimation.
FPT17: An object detector based on multiscale sliding window search using a f...Hiroki Nakahara
?
1) The document describes an object detection system that uses a multiscale sliding window approach with fully pipelined binarized convolutional neural networks (BCNNs) implemented on an FPGA.
2) The system detects and classifies multiple objects in images by applying BCNNs to windows at different scales and locations, and suppresses overlapping detections.
3) Experimental results on a Zynq UltraScale+ MPSoC FPGA demonstrate that the proposed pipelined BCNN architecture can achieve higher accuracy than GPU-based detectors while using less than 5W of power.
6. AIにおけるDeep Neural Network
(DNN)
6
J. Park, ^Deep Neural Network SoC: Bringing deep learning to mobile
devices, ̄ Deep Neural Network SoC Workshop, 2016.
Brain?Inspired
AI
Machine?
Learning
Deep?
Learning
DNN?RNN
Silicon?retina
Neuromorphic
Attention
based?processing
Electronic
cochlea
Bio\mimic
Fuzzy?logic
Knowledge
representation
Natural?lang.
proc.
Genetic
algorithm
SVM
Decision?Tree
K\nearest
neighbor
Bayesian
7. Artificial Neuron (AN)
+
x0=1
x1
x2
xN
... w0?(Bias)
w1
w2
wN
f(u)
u y
xi:?Input?signal
wi:?Weight
u:?Internal?state
f(u):?Activation?function?
(Sigmoid,?ReLU,?etc.)
y:?Output?signal
y ? f (u)
u ? wi xi
i?0
N
?
7
17. Artificial Neuron (AN)
+
x0=1
x1
x2
xN
... w0?(Bias)
w1
w2
wN
f(u)
u y
xi:?Input?signal
wi:?Weight
u:?Internal?state
f(u):?Activation?function?
(Sigmoid,?ReLU,?etc.)
y:?Output?signal
y ? f (u)
u ? wi xi
i?0
N
?
17
19. 勣箔スペック
? サ`バ`におけるディ`プラ`ニングでは20|指のe才
(MAC)處麻が駅勣??
19
J. Park, ^Deep Neural Network SoC: Bringing deep learning to mobile
devices, ̄ Deep Neural Network SoC Workshop, 2016.
J. Cong and B. Xiao, ^Minimizing computation in convolutional
neural networks, ̄ Artificial Neural Networks and Machine Learning
(ICANN2014), 2014, pp. 281-290.
31. Binarized DCNN
? Treats only binarized (+1/-1) values (weights and inouts)
? Except for the first and the last layers
+
x0=1
x1
x2
xN
...
w0?(Bias)
w1
w2
wN
sign(u)
u s
X:?Input?(8bit?for?the?layer?1)
si:?Output
wi:?Weight
U:?Internal?state?(integer)
Sign(U):?Sign?bit?for?U
+1?or?\1
M. Courbariaux, I. Hubara, D. Soudry, R.E.Yaniv, Y. Bengio, ^Binarized neural networks: Training deep neural networks
with weights and activations constrained to +1 or -1, ̄ Computer Research Repository (CoRR), Mar., 2016,
http://arxiv.org/pdf/1602.02830v3.pdf
31