Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
Uh oh okay so we will start for this
00:00:12
served on the last day. So yesterday
00:00:15
was George onto based on soft also for
00:00:19
for I I don't know if you are if you
00:00:21
are very obvious a situation regarding
00:00:24
framework so it also frees Mcgill I'm
00:00:28
working for depending on on the on the
00:00:32
same as assume you say that it's it's
00:00:35
interesting because it does some
00:00:36
strains yes that does does not have on
00:00:38
remote to some strengths and
00:00:40
weaknesses. So to present to the cease
00:00:43
to do we have a right up on the usually
00:00:46
you know I mean I don't what's gone on
00:00:48
the so she will tell you If you can ask
00:00:50
questions on the oh it's Great so far
00:00:56
as technical tests can everybody hear
00:00:58
me and can everybody hear me well okay
00:01:02
perfect so it's at any point there's a
00:01:04
technical difficulty please let me know
00:01:06
because I I can't hear what you hear
00:01:09
and then let's get started so about
00:01:12
questions feel free to interrupt me at
00:01:14
any time if there's a a question that
00:01:16
you think is very relevant to the
00:01:18
slide. And we will have time for
00:01:21
questions at the end of each talk and
00:01:22
there's also the panel at the end of
00:01:24
the day but it's really important that
00:01:26
you get your questions answered because
00:01:27
that's why we're here. So yeah but part
00:01:36
of my microphone just filled out Phil
00:01:41
works. Okay let's just let's just
00:01:45
continue like this. I a little bit
00:01:47
about me I'm a software engineering in
00:01:49
Google research in this area "'cause"
00:01:51
I've been a total for around two years
00:01:55
now after I graduated from imperial
00:01:56
college London and today I wanna talk
00:02:00
to you a bit about answer flow and
00:02:02
specifically about the trade off yeah
00:02:14
test test test yes okay perfect. So
00:02:34
they're gonna talk specifically about
00:02:37
fanciful which is the deep learning
00:02:39
framework built by Google and the talks
00:02:43
are gonna be structured a little bit
00:02:46
differently so the first talk goes
00:02:48
about the core principles behind
00:02:50
fanciful and specifically what we want
00:02:52
from a deep learning framework and not
00:02:54
answer law actually means a lot of
00:02:55
these these requirements then in that
00:02:58
second poker actually gonna go through
00:03:00
a concrete example of how to use a of
00:03:02
low for something relatively simple
00:03:05
linear regression but we're also gonna
00:03:06
look at some really nice thing that can
00:03:09
civil gives you such as distributed
00:03:12
training how to use the GP you know how
00:03:14
to use some of the knife association
00:03:16
tools that we have and then the third
00:03:18
talk we're gonna focus specifically
00:03:20
form deep learning and your networks
00:03:22
and state of the art models and
00:03:24
community contributions and so on. So
00:03:28
firstly what is that several dancers
00:03:31
always this the standard softer for
00:03:33
general machine learning but it's great
00:03:35
for declining in particular and it was
00:03:38
it's open source or can get hot you can
00:03:40
check it out it was forcibly released
00:03:42
in November two thousand fifteen with a
00:03:44
very flexible license you so you can
00:03:46
use it as you please no I'm gonna show
00:03:51
though fee shall be your first so that
00:03:55
you we get a very high level overview
00:03:57
of what's one drummers on international
00:04:01
remote each For deepening over the last
00:04:07
few years I was initial research
00:04:09
project we've since collaborated with
00:04:11
about fifty different teams a to put
00:04:13
these systems in real products across a
00:04:15
really wide spectrum of work today it's
00:04:18
used heavily in our speech recognition
00:04:20
systems in the new photos product email
00:04:24
or if you need to come that experience
00:04:27
in the that entrance. So tense close
00:04:29
this mission dining library that you
00:04:31
stick a school for the blind implying
00:04:34
to a lot of different is doing both
00:04:36
artificial intelligence research and
00:04:38
deploying production models they're
00:04:40
really powerful at doing various kinds
00:04:43
of perceptual and language
00:04:44
understanding these models are able to
00:04:47
actually make it so computers actually
00:04:49
see actually able to understand what is
00:04:52
in an image when you're looking at what
00:04:53
is in a short video clip in that
00:04:55
enables a kinds of powerful product
00:04:58
machine learning it's a secret sauce
00:04:59
products of tomorrow it no longer makes
00:05:01
sense to separate tools for researchers
00:05:03
machine learning and people who are
00:05:04
developing real products there should
00:05:06
really be one set of tools the
00:05:07
researchers can use to try out the
00:05:09
crazy ideas and if those ideas work
00:05:12
they can move them directly into
00:05:13
products without having to rewrite and
00:05:15
the research side a list you then you
00:05:18
understanding to existing problems
00:05:20
advance acidity are existing problems
00:05:22
understand the problems where it's for
00:05:26
an engineering side because it it is
00:05:28
insights from the research. And use
00:05:30
them to individual products in front of
00:05:33
features they wanted us for quite
00:05:36
offencive flow is to allow
00:05:37
collaboration and communication between
00:05:39
researchers it allows the researcher in
00:05:41
one location to develop an idea next
00:05:43
alright and then just send code that
00:05:45
someone else can use on the other side
00:05:46
of the document a lot easier for yeah
00:05:51
I'm not going to have this as an open
00:05:55
source to really hopes that instruments
00:05:58
that effort up. So they expect
00:06:00
developers to be able to do a lot more
00:06:02
than they can do today we think we have
00:06:04
the bass machine learning of
00:06:05
destruction and and mediocrity share.
00:06:08
And that's we wanted oh oh so I guess
00:06:21
that gives a very high level overview
00:06:23
of what the aim of tens of low is an
00:06:26
authority touches upon a lot of the
00:06:28
discussion yesterday and the panel with
00:06:30
what frameworks are good for research
00:06:31
what radio frameworks are good for
00:06:33
development and so on. And I really
00:06:36
want to stress and we will look at this
00:06:37
in more detail in this talk is that ten
00:06:39
several ins to be a tool for everyone.
00:06:42
So what the aim is to bridge this gap
00:06:44
between researchers and developers and
00:06:47
they are scientists and so on. So this
00:06:50
topic focus on two things first why
00:06:53
does Google care about machine learning
00:06:54
to might wonder then the second thing
00:06:57
is what makes them so for good machine
00:06:58
learning framework and we're gonna go
00:07:00
through through this in to the second
00:07:02
one especially in detail. So firstly
00:07:05
why does Google care about machine
00:07:07
learning and specifically deep
00:07:09
learning. Well declining has this
00:07:11
really nice promise of universal
00:07:14
machine learning right. So the idea is
00:07:16
that you can use a similar set of
00:07:18
algorithms to do speech recognition
00:07:20
query understand the ink text to
00:07:24
speech. And whatever else you might
00:07:26
want to do and you don't have to do
00:07:28
with feature selection yourself so that
00:07:30
is a very nice promise. And the
00:07:33
advantage of deep learning is that
00:07:34
apart from giving this promises
00:07:36
actually works because it gives the
00:07:37
promise that the very nice promise but
00:07:40
if it doesn't work better than the
00:07:41
alternatives then it wouldn't be that
00:07:44
that useful. So then I think about
00:07:46
declining that now it's currently state
00:07:48
of the art in speech recognition image
00:07:50
recognition machine translation and a
00:07:52
lot of other applications not Google
00:07:56
we've seen a very big growth in the use
00:07:59
of deep learning from very little in
00:08:03
the the beginning of two thousand and
00:08:04
twelve to more than two thousand
00:08:08
directories containing a model
00:08:10
description file in the Google source
00:08:14
code repository so that's a lot of code
00:08:17
and a lot of models where are they used
00:08:20
to well and a lot of products turns out
00:08:23
so I hope you can see some of your
00:08:24
favourite products here I can so for
00:08:28
all those global keyboard inbox GE mail
00:08:30
drive you to all these products now are
00:08:34
better because of machine learning and
00:08:39
you also probably know about all my
00:08:41
goal which achieved this the
00:08:43
breakthrough in I that was taught to
00:08:45
not be possible at the moment you just
00:08:47
in a couple of years. And this is
00:08:50
another showcase of how Google is
00:08:53
interested to push the field forward
00:08:56
now going to products some of the
00:09:00
product that I mentioned before in the
00:09:01
product slide I'm gonna go a bit of
00:09:03
into detail into some of them. So you
00:09:06
might not inbox that's an email me now
00:09:08
provided by google. And in November two
00:09:10
thousand fifteen it launched this
00:09:12
feature called smart reply. So the idea
00:09:16
is very simple when you send me an
00:09:18
email me how do you wanna go for dinner
00:09:21
tomorrow. I have a couple of very
00:09:24
likely answers that I might give sure
00:09:27
why why not or how about today or how
00:09:31
about lunches that and you can see how
00:09:34
machine learning is a really good fit
00:09:35
for this task because from a lot of
00:09:37
examples you can kind of learn what the
00:09:39
possible answer is that an incoming
00:09:41
email and in boxes the smart reply
00:09:44
feature but inbox has been very well
00:09:46
received well it was initially launch
00:09:48
of April fools joke. But that also has
00:09:51
a matured a lot because in February two
00:09:54
thousand sixteen and restored up more
00:09:56
than ten percent of mobilising import
00:09:58
import supplies use market like which
00:10:01
makes sense because if I'm the goal
00:10:03
trying to catch the tram I don't want
00:10:05
to start typing I just press a button
00:10:07
and it does that for me. So that's
00:10:09
really great now another product that
00:10:13
makes a lot of sense to use mushy
00:10:16
learning is Google play music. So U
00:10:19
Penn euro listening history and given
00:10:23
certain types of music that be like it
00:10:26
or can recommend you play this it can
00:10:29
recommend you other channels that are
00:10:31
similar for example and so on one of
00:10:35
the new ones additions or relatively
00:10:39
new additions is the ability to
00:10:41
querying for those in a two clearly or
00:10:45
for according to some string in Google
00:10:48
for also if you're like me when it
00:10:50
probably you take literally thousands
00:10:51
of followers when you come home and you
00:10:53
wanna show your parents he actually
00:10:55
this really nice cherry blossoms it
00:10:57
would take you hours to scroll to
00:10:59
scroll that right well with this you
00:11:01
can just say hey these are the cherry
00:11:03
blossom for because I just carry them
00:11:06
in groups for those but that's also
00:11:07
saves you a lot of time and if you're
00:11:11
travelling you might be travelling to a
00:11:13
country where you don't really know the
00:11:15
language more descriptive and of the
00:11:19
inhabitants of the country. So you
00:11:21
might want to use this translate
00:11:24
feature which allows you to take a
00:11:25
picture is about particular sign for
00:11:28
example and translated so this combines
00:11:31
computer vision and translation to to
00:11:34
provide a better user experience now
00:11:38
this is about to why Google cares about
00:11:40
machine learning and how our product to
00:11:42
become better due to it and I let's
00:11:44
talk a bit about ten subplot which is
00:11:45
the engine that Powers a lot of these
00:11:47
features. So forcefully why build
00:11:51
tension flow in the first place vocal
00:11:53
have have not they're deploring system
00:11:55
it was called especially if it was
00:11:57
really great for scalability and
00:12:00
production training but it was not as
00:12:02
flexible as researchers would have
00:12:05
wanted. So was a bit again this this
00:12:08
trade off between research and
00:12:09
production I think things. And having
00:12:12
this understanding of already trying
00:12:14
something the first time really allowed
00:12:16
us to simplify the problem and to learn
00:12:19
from previous mistakes. And in order to
00:12:24
realise what we want from a machine
00:12:26
learning system and from a deep
00:12:27
learning system we have to think about
00:12:29
who uses the machine system right
00:12:31
different use cases have different
00:12:33
requirements. So as you probably can
00:12:36
imagine some researchers developers and
00:12:39
they just Landis want to use research
00:12:41
framework. And they all have different
00:12:45
goals in mind. So researchers want to
00:12:48
quickly the rate they want to
00:12:50
unspecified their new crazy idea and be
00:12:53
able to see if it works or not wanna
00:12:55
let a nist or something much bigger
00:12:57
such as in that developers want to take
00:13:00
these ideas and quickly put them into
00:13:02
products without having to wait for one
00:13:04
year without having to port some code
00:13:06
we'd written in a research system and
00:13:09
they just find this they just want to
00:13:10
tweak these ideas that research have a
00:13:13
researchers have on their own data sets
00:13:16
to get a maximum performance. So with
00:13:20
this in mind. This is what we think
00:13:22
that one would want from the research
00:13:24
system. So forcefully ease of
00:13:26
expression for a lot of crazy ideas
00:13:29
that you might have scalability you
00:13:31
want to be able to run your experiments
00:13:33
pretty quick portability this
00:13:36
especially important for developers you
00:13:37
want to be able to run on a variety of
00:13:39
platforms quite easily reproached
00:13:42
reproducibility so that research chairs
00:13:44
around the world can collaborate they
00:13:46
can sure call they can share models.
00:13:49
And the production readiness so again
00:13:51
the idea of going to really quickly
00:13:53
from research to real products. And I'm
00:13:55
gonna iterate through all of these I'm
00:13:57
gonna take them one at a time. And
00:13:59
actually show you how tense from needs
00:14:02
me meets give each of these so let's
00:14:06
start with ease of expression. So
00:14:08
architecture behind dancer problem. It
00:14:11
is very flexible. And the core idea is
00:14:14
compare the computational perhaps
00:14:16
something that it's similar to other
00:14:19
framework and the idea that you is that
00:14:21
you always specify your computation as
00:14:24
a director sickly graph and you then
00:14:26
add optimisation on top of that and the
00:14:31
general procedure when you work with
00:14:32
sense of flow is to define a graph in
00:14:34
the high level language so you don't
00:14:36
want to deal necessary with memory
00:14:38
management this on when you just want
00:14:39
to be able to say hey this is how many
00:14:42
of them all those should look like the
00:14:44
graph is then compiled and optimise.
00:14:46
And then executed I dream parts you
00:14:49
might not want executing paragraph
00:14:51
fully on the available label devices
00:14:54
which might be CPUGPU or what are the
00:14:57
device you might want to use. So the
00:15:00
core of tens of fuel the core execution
00:15:03
system is in C plus plus and this
00:15:05
allows it to be in a very efficient and
00:15:08
very a good in terms of speed. But
00:15:13
you'll we'll all different front ends
00:15:15
and then and terms of how you want to
00:15:17
specify the computation so you can
00:15:19
specify your computational bracken
00:15:21
python C plus plus today but if you
00:15:23
really like job ah you can actually add
00:15:25
another front and easy another
00:15:32
important point when talking about
00:15:33
things that expression are interface
00:15:35
it's again different people want
00:15:37
different kind of interfaces if I'm a
00:15:39
researcher I want to be able to specify
00:15:42
my models up to the matrix
00:15:43
multiplication level right I want to be
00:15:45
able to say this to answer targets with
00:15:48
a together with the stance Erin I
00:15:49
applied this operation or on them. But
00:15:52
you might also want to be able to use
00:15:54
higher level API is you don't want to
00:15:56
go to the matrix multiplication level
00:15:58
for a CNN or for a deep neural matter
00:16:01
because you might we use the same code
00:16:03
again and again and so on. That's of
00:16:05
little allows you to go both ways
00:16:07
depending on your on your which case
00:16:08
which is very useful now about
00:16:13
scalability. So imagine you you
00:16:17
probably are already aware of this
00:16:19
particular run experiments and they
00:16:20
take a lot of time this can become
00:16:23
easily cumbersome social experiment
00:16:25
takes a couple of minutes or hours this
00:16:26
is pretty great right I can start my
00:16:28
experiment I can easily see again have
00:16:30
this feedback loop about my idea is
00:16:32
good I'm gonna pursue this or am idea
00:16:34
doesn't really work or I have a bargain
00:16:36
I want to to debug it does little bit.
00:16:39
And this is great for research right
00:16:41
you get this good feedback loop if
00:16:44
experiments take a couple of days
00:16:47
that's horrible at this point you
00:16:49
probably already start trying multiple
00:16:51
ideas in parallel and trying to see how
00:16:54
each of them or call because you have
00:16:55
to wait a couple of days. Now if you go
00:16:58
two weeks then you you can see how
00:17:01
appropriate sporting right you can only
00:17:03
try your best ideas out now if things
00:17:08
takes up more than a month probably
00:17:09
it's not really worth trying. So how
00:17:13
does that several allow you to run
00:17:15
experiments quickly well you can use to
00:17:17
be use you can use multiple ports and
00:17:19
multiple cheap you cards and you can
00:17:21
also distributor training on multiple
00:17:23
machines so if you have in your lab the
00:17:26
cluster you can you can use that to
00:17:28
decrease your experiment I'm you know
00:17:32
when you want to distribute computation
00:17:35
you always have to have into account
00:17:36
communication overhead right. So if I'd
00:17:40
dispute computation on two machines but
00:17:42
I only get ten percent speed
00:17:43
improvement that's not really great
00:17:45
because I'm using a lot of
00:17:46
computational power for very little
00:17:49
again. And importance of flow there are
00:17:52
two solutions in particular that I'm
00:17:54
gonna name here that I use to avoid
00:17:57
this communication overhead. And the
00:17:59
first one is to exploit model
00:18:01
parallelism. So especially for a
00:18:04
articulating our couple models that are
00:18:06
pretty good for that. And the second is
00:18:08
to exploit data for Ellison because our
00:18:11
training sets can be you apart can
00:18:14
split into part in is them at the same
00:18:16
time. So let's look at each of these so
00:18:20
for more liberalism how do you do that.
00:18:24
Well you can use instruction
00:18:27
parallelism one single for this is
00:18:29
pretty good it's pretty much free when
00:18:31
you get to do this across course you
00:18:33
have to use straight paralysis them
00:18:35
which is almost free unless you have to
00:18:38
go to war sockets. And across devices
00:18:41
yeah so if you go in between multiple
00:18:44
GP use you are often limited by PCIE
00:18:47
bandwidth and across machines you are
00:18:50
very off of limited by network and they
00:18:52
do it or latency and with this in mind
00:18:58
let's look at how model paralysis
00:19:00
actually works for a network like a
00:19:01
convolutional your network. So the idea
00:19:04
behind convolutional neural networks is
00:19:06
that you have this image that close to
00:19:08
to layers the this is the input layer
00:19:12
layer one layer two and then you get
00:19:13
this fine the representation of the
00:19:15
image at the end. And you have these
00:19:17
kernels also called local receptive
00:19:20
fields that get applied to each part of
00:19:23
the image patch so you can see them
00:19:24
moving around like this in the way you
00:19:29
can split this mortal into multiple
00:19:33
machines are multiple chorus is by
00:19:35
partitioning parts of each layer that
00:19:39
are gonna communicate much together to
00:19:41
be on the same machine. So can you want
00:19:43
will avoid to put parts of the model on
00:19:45
a machine and part of another model
00:19:47
another machine in these two parts have
00:19:48
to communicate all the time because we
00:19:50
end up with this overhead. But in this
00:19:52
case if we do it like this we can we
00:19:55
minimise the network traffic because
00:19:58
when we compute the values of the
00:20:00
neurons at this layer we more or less
00:20:03
always look at the ones in the same
00:20:05
partition on the same machine apart
00:20:07
from these ones that the boundaries. So
00:20:09
you can't really because of having a
00:20:11
compulsion guarantor for you can
00:20:13
completely avoid it but you can
00:20:14
minimise now they paralysis. So the
00:20:22
difference is that we use are also
00:20:24
getting might be and usually we use
00:20:27
batches anyway and the idea behind it
00:20:29
probably them is how about we use we
00:20:31
train multiple batches at the same time
00:20:34
or we see examples by different model
00:20:36
directly costs at the same time so the
00:20:38
idea here is that I don't have one
00:20:40
model. I have multiple model replica as
00:20:44
that copy the pro parameters and they
00:20:46
each to their specific computation and
00:20:49
then they tell a parameter server that
00:20:50
keeps kind of the gold standard for
00:20:54
what the parameter value should be how
00:20:57
to update the parameters. So this is
00:21:00
kind of how it looks like. So as I said
00:21:02
you have multiple a model Ripley
00:21:04
because you have the data that goes
00:21:06
here in parallel each replica sees
00:21:08
different examples and then one the
00:21:11
model directly come as computed
00:21:14
finishing its update so for example in
00:21:16
the case of neural networks gradient
00:21:19
really dysfunctional it sends this
00:21:21
update to the parameter server sense
00:21:23
it's okay please update the parameter
00:21:25
server at the parameters and sold all
00:21:30
the other probably colours now when you
00:21:34
think about this the picture. You have
00:21:37
to understand that there are two ways
00:21:38
to do these updates right this directly
00:21:41
cocktails the parameter server updates
00:21:43
the parameters this Ripley cut does the
00:21:46
parameters or were updated the
00:21:47
parameters and so on. But you can then
00:21:50
combine these updates into one single
00:21:53
update right if we want to be as close
00:21:55
to the original algorithms for example
00:21:58
we understand we want to do this update
00:22:01
synchronously so we wait for all
00:22:03
directly us to finish we combined
00:22:05
obvious together and we apply them only
00:22:07
once to the parameters or four so this
00:22:10
is how it looks like this model has
00:22:13
computer to update this model replica
00:22:16
has completed an they this model
00:22:17
replicas computer didn't know if they
00:22:19
get combined together and then only one
00:22:22
day a one update get sent to the
00:22:24
parameter server. So this is actually
00:22:27
equivalent to having an and times
00:22:29
larger batch size of the computation
00:22:31
that you do the the training that you
00:22:33
do it is exactly as if you have one
00:22:35
model with ten times larger than size
00:22:38
the pro is that you have no gradient
00:22:40
stillness or the model replicas are not
00:22:43
operating a borrowing from still
00:22:45
gradients. But the clone of this
00:22:48
approach is that if one machine fails
00:22:50
then you have to recover and wait the
00:22:52
other machines have to wait for for
00:22:54
this one to to recover you can also do
00:22:57
an synchronous updates. So here the
00:23:00
difference is this part each more
00:23:02
directly got can update the parameter
00:23:05
service. But as you can imagine here
00:23:08
the problem is that one model directly
00:23:11
come I sent and updates with respect to
00:23:13
some parameters that are no longer
00:23:14
there because some of the replica has
00:23:16
modified and so it's not really the
00:23:18
same as what we usually do but the
00:23:21
problem is that it's relatively full
00:23:23
and and in practise it works if you
00:23:26
don't push it would too many rep
00:23:28
because it works. But but both kinds of
00:23:34
updates of both with synchronous and I
00:23:36
think Ron yes you really want model
00:23:38
computation to be large enough. So that
00:23:42
it's worth sending the parameters over
00:23:43
the network. So we saw what sorry we
00:23:47
saw here that each model directly count
00:23:49
has a copy of the parameters of the
00:23:51
parameters server is sending the
00:23:53
parameters over the network to the more
00:23:55
directly by every time this there was a
00:23:57
thing. So if you say these parameters
00:23:59
all the time and if there are a lot of
00:24:01
them use you waste a lot of time by
00:24:04
just sending the parameters of or the
00:24:05
network. So tight gaze to strike this
00:24:07
balance between the computation that
00:24:09
the network bus with one set of
00:24:10
parameters without needing update
00:24:13
versus the number of updates that you
00:24:15
do and given this this P down to depend
00:24:20
on the kind of one also for very dense
00:24:21
models you can get ten to forty speed
00:24:24
up compute directly cars and sparse
00:24:26
models that have less parameter support
00:24:28
many more red because even up to one
00:24:30
thousand and in terms of more doubles
00:24:33
certain models to use each parameter
00:24:36
many times so for example convolutional
00:24:38
networks apply the kernel for the local
00:24:41
receptive field on all possible patches
00:24:44
of the image right to that means that
00:24:46
the same parameters is your arm used a
00:24:48
lot before you need an update. So that
00:24:50
makes them good candidates for data
00:24:52
paralysis same for my current models.
00:24:55
So if we can models are big used for
00:24:58
very much used for sequences that what
00:25:00
they're built for so if I want to do
00:25:01
for example some language modelling and
00:25:03
I feed into the network we have as
00:25:05
giving a talk about the answer for all
00:25:07
either one word at a time so me had a
00:25:10
lot is giving a talk a or one character
00:25:13
at a time the model uses the same
00:25:16
parameters for each input until I'm
00:25:18
done with this sentence. So that makes
00:25:20
them what candidates for this kind of
00:25:22
late upper Allison because they do a
00:25:24
lot of computation before they need to
00:25:25
know so now let's look at some numbers
00:25:30
the part is how this helps. So this
00:25:33
plots the image Annette inception
00:25:36
synchronous training. So you probably
00:25:39
know from yesterday what the inception
00:25:43
model is it's a very big architecture
00:25:46
that this current here trained on the
00:25:48
internet and you can see that time in
00:25:52
hours versus the obtain precision on
00:25:56
one CPUNGP use in fifty two pews. So if
00:26:04
we look at one GPU versus fifty GP was
00:26:07
and we fix our precision so I say that
00:26:10
if my model has zero point five where
00:26:12
other one preparation precision I can
00:26:15
go home for the day I've done my work
00:26:17
and I'm happy to go home well if I use
00:26:19
a one GP you then I have to stand for
00:26:22
for three days if I use. PGPU was in
00:26:26
two point six hours I can already go
00:26:28
home so that's a very big difference at
00:26:30
thirty times difference. So not is that
00:26:33
it's not mean you're right I increase
00:26:35
the number of TP use fifty times. But I
00:26:38
still get the thirty time so they're
00:26:39
still an overhead. But I still get the
00:26:43
massive massive improvement here. Now
00:26:45
if we look at the GP was versus fifty
00:26:47
GP use a different accuracy levels. So
00:26:50
it's zero point six and zero point six
00:26:52
five you will also see around the four
00:26:55
times speed up like going from ten jeep
00:26:58
used to five GPU so this is a pretty
00:27:00
good and this is kind of how the graph
00:27:04
looks like for how we when you increase
00:27:06
the number of workers. Um voices we
00:27:10
increase the number for orders how many
00:27:11
examples per second more the the model
00:27:14
can see so you see that if you use a
00:27:17
hundred workers you get a fifty six
00:27:18
speed up versus if you use one if you
00:27:21
use a sixteen workers to get a fifteen
00:27:24
speed up courses if you one so what's
00:27:26
where again it's nothing actually don't
00:27:28
get them a hundred times speedup if use
00:27:29
one it's clear that there is some
00:27:31
overhead. But but can still speed up
00:27:33
things consider and so they got
00:27:38
relatives and not only great in theory
00:27:41
and and so on it's actually very
00:27:43
important for also this indigenous
00:27:44
inception training actually use this
00:27:46
fifty GP use smart reply I talked about
00:27:50
this the feature of inbox look a bit
00:27:52
earlier it uses the sixteen ripped
00:27:54
across to train the model each with
00:27:56
multiple GP use. And the state of the
00:27:59
art language model one billion moral
00:28:01
benchmark uses both data and model
00:28:04
parallelism one thirty DG you so this
00:28:06
is actually very much used in practise
00:28:08
now this talks a lot about multiple
00:28:14
devices but how about one device ten
00:28:17
several performance I put this here
00:28:19
because it's related to the to the my
00:28:21
previous slides. So abundant supply was
00:28:23
initially we in November two thousand
00:28:25
fifteen it definitely had some speed
00:28:28
issues. But it has improved and it
00:28:32
continues to improve so you can see
00:28:34
here this this number one these and
00:28:38
numbers it's getting quite good but
00:28:40
they're still definitely a lot of work
00:28:42
work to do in this disrespect now about
00:28:47
portability. So as I said before it's
00:28:52
very important that you have a machine
00:28:54
learning framework that runs on a
00:28:56
variety of platforms because that also
00:28:58
decreases the time between researchers
00:29:01
coming up with ideas and the time you
00:29:03
want to you have to production Isa
00:29:04
model and also saves a lot of developer
00:29:07
time because they don't have to port
00:29:08
code from one one architecture to the
00:29:12
other. So that's a flow works on CPUS
00:29:16
keep use one up for phones distributed
00:29:20
systems and even customise your
00:29:22
maligned hardware so it's very very
00:29:23
flexible in that way and if you're
00:29:26
interested in how to do this there is a
00:29:28
a lot of tutorials out there of how to
00:29:31
do how to use to answer for both and
00:29:34
read and I was so here are some screen
00:29:36
shots on how to use them in each net
00:29:39
already trained models you don't have
00:29:40
to train your own model to do image
00:29:43
recognition on a right and this is all
00:29:47
and I was so if you want to see that in
00:29:51
the speech or there's some ice cream
00:29:53
and chocolate sauce then you can build
00:29:55
an up to coming to show you that now
00:29:59
how about reproducibility and so it's a
00:30:05
flaw is open source as I said but
00:30:08
flexible the Apache two point zero
00:30:09
license. And this is very important for
00:30:12
us because we think that this really
00:30:14
helps push English learning research
00:30:16
for word because researchers cannot
00:30:18
publish code for new algorithms tends
00:30:21
the flow they can create repositories
00:30:23
for train models. And that's really
00:30:25
also makes research papers reproducible
00:30:27
how about the external adoption of tens
00:30:33
of so if we look at the cute have most
00:30:35
of the people planning framework that
00:30:37
we we are familiar with on our own
00:30:39
guitar. That's a flaw has twenty seven
00:30:42
thousand stars or did when I created
00:30:44
the slides and ten dollars and forks so
00:30:49
it's so much popular then again the
00:30:52
other frameworks in terms of get a bus
00:30:54
tires and forks. And this is even
00:30:56
though it was lunch and only November
00:30:58
two thousand fifteen also in terms of
00:31:03
external adoption in seventy two hours
00:31:06
after lunch they were more than fifty
00:31:07
thousand installs. And more than five
00:31:10
hundred thousand since November two
00:31:11
thousand fifteen. And despite it being
00:31:14
launched only November two thousand
00:31:17
fifteen it is them most for people in
00:31:20
two thousand fifteen out to pick up. So
00:31:22
we think that's pretty another four
00:31:27
point of tends to flow our tutorials
00:31:30
and documentation. So it's very hard to
00:31:33
start with any framework especially if
00:31:35
you're a beginner with machine learning
00:31:36
to don't know much about machine
00:31:38
learning or planning in particular you
00:31:40
also have to learn to frame or you also
00:31:42
have to learn how to deal with machine
00:31:43
learning and I think that's a flaw has
00:31:46
a really wide variety of tutorials out
00:31:49
there. And it caters to both needs. So
00:31:52
if you already are very much familiar
00:31:55
with deep learning you can provide the
00:31:58
expert in this tutorial which keeps a
00:32:01
lot of the deep learning details and
00:32:03
just goes into hate this is how you
00:32:04
stance of role or you can use the intro
00:32:07
miss tutorial which goes a log into the
00:32:11
details of how the model actually
00:32:16
actually works. And of course if you
00:32:19
want to find out even more about how
00:32:22
the internals of ten for works. There
00:32:24
is a excellent white paper we used in
00:32:27
two thousand fifteen that talks a lot
00:32:29
about the internal computation engine
00:32:31
and even though the optimisations
00:32:34
performed by answer flow and so on. I
00:32:36
definitely recommend it. Now about
00:32:47
production readiness. So it's very
00:32:51
important these days especially with
00:32:54
with declining advancing so fast to be
00:32:56
able to integrate these new models in
00:32:58
this meeting breakthroughs in products
00:33:01
to actually make them available and
00:33:02
useful to people that use their phones
00:33:05
or their laptops everyday. And that's a
00:33:08
it's actually very easy to train models
00:33:10
in python so this is ideals very high
00:33:12
level. And then developers can use this
00:33:15
into C plus plus enough to serve
00:33:17
production cost of of course is very
00:33:19
very efficient in much better for
00:33:21
production code. And them because you
00:33:25
can use the role models that developers
00:33:27
don't have to train models themselves
00:33:30
they can just used the ones that the
00:33:33
researcher strain. It's not an as a
00:33:36
concrete example going back to smart
00:33:39
reply inbox in four months it was
00:33:42
stolen from research in deep learning
00:33:46
product to the project to launch
00:33:50
product that you can you all use on
00:33:52
your phone now. So definitely having
00:33:55
this short iterations cycle and having
00:33:58
the same tool used by everyone helps a
00:34:01
lot with moving moving much faster so
00:34:06
in conclusion for this for sport I
00:34:08
think I was a bit because I was talking
00:34:10
machine learning is definitely changing
00:34:15
the world is changing how we use our
00:34:18
phones how we use our computers how we
00:34:20
think about what problems we can solve
00:34:22
or not a lot of problems that we
00:34:24
thought to not be able to solve right
00:34:28
now are becoming easily easier to be
00:34:32
cracked. And the nice part is that you
00:34:35
can be part of it so when you think
00:34:36
about solving a problem. You you should
00:34:39
actually think should I use machine
00:34:41
learning for this can I use a machine
00:34:43
learning for this and there's a lot of
00:34:45
tools out there including tensor for
00:34:46
all that are free that have a lot of
00:34:48
tutorials and a lot of documentation.
00:34:51
And they can really help you help you
00:34:53
get started with this. And just I think
00:34:56
this is the the incoming message
00:34:57
especially for for those who have not
00:35:01
one already into this mindset it's very
00:35:04
easy to get started in is very easy to
00:35:06
make an impact these days with all
00:35:07
these these available tools. So that
00:35:10
why will take questions if you have
00:35:12
some oh and then we'll we'll continue
00:35:15
with the second talk okay yeah I thank
00:35:39
you for the top. So I wanted to know is
00:35:42
is there any or other or this framework
00:35:48
does not open source that using the
00:35:50
will was not open source here I think
00:35:55
the difficult questions here I rather
00:35:56
not comment I would just say that it's
00:36:00
yeah yes so you see that things might
00:36:06
take more to get out there because
00:36:08
they're very high standards to make
00:36:11
things open source so for example the
00:36:12
distributed training was not in the
00:36:14
first open source please but it got
00:36:15
there now right. So that's that's what
00:36:17
I can say things are are getting up oh
00:36:25
thanks for the talk it's not a question
00:36:39
about the internals of Google are there
00:36:41
any projects where you try and then you
00:36:44
decided not to use Spencer flow again
00:36:49
I'd rather not comment but I don't
00:36:52
think that's awful has any specific
00:36:56
limitations so people are definitely
00:37:00
it's it's not like it's an interesting
00:37:02
thing that there are problems with it
00:37:03
is definitely very much used and it's
00:37:04
made if there are problems I'm sure
00:37:06
that people are gonna fix it. So I'd be
00:37:09
very surprised but again. It's always
00:37:12
asking and asking about open source
00:37:14
file maybe I one question there is a
00:37:24
lot of contributions from external to
00:37:27
Beatles so there are plenty of
00:37:29
contributions for and me Reading cards
00:37:32
and we will go through through this
00:37:33
also later small both to the core
00:37:36
repository there plenty of external
00:37:38
contributions. And also so feature
00:37:41
requests the idea is to if you want
00:37:43
something don't just assume a it's not
00:37:47
there I'm gonna try later just ask for
00:37:49
it and for example for dumb distributed
00:37:52
oh the way to specify your class to for
00:37:55
the the distributed computation and
00:37:57
talk about that in the second talk. Um
00:38:00
it's a bit cumbersome today so people
00:38:03
we are actually asking people what you
00:38:05
want to see right so it's not only that
00:38:08
of course we accept contributions and
00:38:10
if you look on the get up a repository
00:38:13
to actually a lot of very interesting
00:38:14
ones and people are collaborating even
00:38:16
meeting together just we want to do
00:38:19
this not as a four point two pewter but
00:38:20
just we're gonna meet encoders to get
00:38:22
transcend the patch in the patches get
00:38:24
integrated with the repository so
00:38:26
definitely and to you know T algorithms
00:38:34
for quite in this and that are used in
00:38:36
the distributed version of the
00:38:38
synchronous one and you synchronous one
00:38:40
I mean I have ideas about suspects
00:38:42
maybe a more useful the synchronous and
00:38:44
downpour for the asynchronous but so I
00:38:47
mean you can specified optimiser that
00:38:49
you want to use is just that the weight
00:38:52
updates will get applied is different.
00:38:55
So that but our burden it's not that
00:38:58
when you choose to do they got pearls
00:39:02
and you it will fix the algorithm for
00:39:04
you. Because when you build the
00:39:05
computational graphic unspecified
00:39:07
optimiser. And it's just that how the
00:39:09
updates get applied to the parameter
00:39:12
server that changes between the yeah
00:39:17
but there's a there's a constraint on
00:39:18
that depending on on on whether a
00:39:20
department or server was executive
00:39:22
director executive communication
00:39:24
there's a limitation of which kind of
00:39:26
distributed algorithm you can actually
00:39:28
apply to get the stochastic gradient
00:39:31
right. So downpour for example is
00:39:34
famous for the fact that not only a
00:39:35
synchronous but that executors can
00:39:37
communicate between themselves which
00:39:39
brings a downpour to some kind of to
00:39:42
some H is in in in some cases where
00:39:46
you're great search get to so you have
00:39:51
a centralised parameter server
00:39:53
executors talking to it without talking
00:39:56
to each other and that's you have to
00:39:58
specify get to visit mice yeah so
00:40:03
actually go back or yeah it's cool. So
00:40:10
I think it's less about what were yeah
00:40:13
so it's less about the optimiser
00:40:14
because the optimiser will just it's
00:40:16
more about I think the questions more
00:40:18
about how you update these parameters
00:40:20
right because optimiser will run here.
00:40:22
And then the optimiser would tell you
00:40:25
this art topic that I need to do in
00:40:27
these updates in the synchronous case
00:40:30
get applied in the order that they come
00:40:33
to the parameters fervour. And then and
00:40:36
I see in the synchronous case they get
00:40:39
combined in the the usual way. I think
00:40:46
for the talk I I'm actually wondering
00:40:48
like another point C plus plus and bite
00:40:52
on and what these needed to bring it
00:40:54
all on draw it all wireless phones
00:40:57
doing need to rewrite the four were
00:40:58
also in joe. So things I'm not aware
00:41:02
for I was four and right there is a an
00:41:05
example that you can have a look at
00:41:06
look out there I some J and I called
00:41:09
that you have to deal with so the the
00:41:12
there it's not a big because with and
00:41:16
really have to do with Java you have to
00:41:18
do with the bit with dismissing this
00:41:20
but there is so for for that at that
00:41:22
actually there's actually examples
00:41:23
online it's not that much cold. So I
00:41:26
have to look at it it's not that much
00:41:27
but you definitely have to deal with
00:41:28
this a way of holding C plus plus from
00:41:31
from Java So thanks for yeah being with
00:41:37
us today that's easy an idea time I I I
00:41:39
would like to know how you use their
00:41:41
stuff law or if you are developing it
00:41:44
to activate though if you use a oh what
00:41:46
you're doing with these are job at
00:41:48
Logan actually okay so I am not
00:41:52
developing as a plot but I am
00:41:54
successfully and happily and successful
00:41:57
and happy user. Um I specifically work
00:42:00
with recurrent neural networks for an
00:42:03
LP related tasks and this is what I've
00:42:06
been using passive role quite well for
00:42:08
some time now if I think the the thing
00:42:17
about it is that it depends a lot where
00:42:20
where you come before. So I knew
00:42:22
fountain from before. So it makes it
00:42:25
very easy to adopt civil but because we
00:42:29
have all these tutorials I think I was
00:42:32
actually surprised by looking at some
00:42:34
of them of okay I could easily get
00:42:35
this. So even if you don't know machine
00:42:37
nineteen if you don't know a a prior
00:42:41
framework so it's not only about ten
00:42:43
suppose not only about switching from
00:42:45
one other frame onto the other it's
00:42:46
about encouraging people that currently
00:42:49
don't use the learning particularly at
00:42:51
all to start doing it because it's as
00:42:54
we will see some examples is actually
00:42:56
surprisingly easy to use and gives very
00:42:59
good for a user expanded because it in
00:43:01
python for the part that you use mainly
00:43:05
for training in experimentation you can
00:43:07
easily integrate with your free with
00:43:10
things that you really use for python
00:43:13
data analysis right so I'm the tape I
00:43:16
can person and I really like that I can
00:43:18
use with ten so for the result station
00:43:19
tools that I was used to for in the
00:43:21
last five years. So I I definitely see
00:43:25
the python part that's a plus it was
00:43:26
also discussed yesterday so the the
00:43:31
licensees Apache to yes. So it's it's
00:43:34
kind of business friendly so yes I'll
00:43:36
do you keep track on what companies are
00:43:40
actually going to awful and product are
00:43:42
likely maybe you do not have a way of
00:43:44
incorrect but some for this I'm not
00:43:47
aware of anything of keeping track of
00:43:50
it. But the point is to just put it out
00:43:53
there give as much support as possible
00:43:55
so that people start using it hi. So I
00:44:01
was wondering whether that's a flaw is
00:44:03
paralysed then on a single machine via
00:44:07
multiple colours out of the box or is
00:44:10
it something that a need special like
00:44:13
configuration I think this is something
00:44:18
that you have to configure yourself
00:44:19
yes. example I mean from PVP previous
00:44:25
experience with kathy. It's simply a
00:44:28
matter of using a a normal library for
00:44:32
privatisation other month because you
00:44:34
just launch things on multiple colours
00:44:37
and they are actually paralyse you get
00:44:40
the speed improvement directly like
00:44:42
with one one line So for example by
00:44:43
using open plus you mean or well it's I
00:44:47
think it's I I don't remember the
00:44:49
library name but I think it's not
00:44:50
knowing about sniper ah okay so I think
00:44:55
I I don't have an example of
00:44:56
specifically how to use more multiple
00:45:01
processes but if you have multiple GPU
00:45:03
was for example on the same machine
00:45:06
that is pretty easy. So I wish one
00:45:09
example of how to distribute your graph
00:45:10
one if you have to keep you card or to
00:45:12
CPU card then that is very easy to do
00:45:14
and basically one line okay so maybe we
00:45:23
can of the coffee break because sums
00:45:26
that fifteen which is always a bit
00:45:27
short again brings me either yeah you

Share this talk: 


Conference Program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.
2363 views
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.
427 views
Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.
328 views
Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.
814 views
Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.
342 views
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.
2155 views
Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.
275 views
Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.
151 views
TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.
2655 views
TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.
1703 views
TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.
2249 views

Recommended talks