NIPS 2011 Domain Adaptation Workshop: History Dependent Domain Adaptation

>>LAVOIE: Mates. Thank you. I’m Allen Lavoie,
I’ll be presenting History Dependent Domain Adaptation. And this is joint work with co-authors
from Google Pittsburgh, Matt Otey, Nathan Ratliff, and D. Sculley, who’s here today.
So the main idea of this paper is that standard methods and machine learning are often too
myopic for the domains that they’re applying to. So typically we’re focused on minimizing
errors now and in the future. But many times machine learning is applied to systems which
have an expectation [INDISTINCT] consistency overtime. So I’ll start by giving an example
and then I’ll formalize the problem just a little bit and we’ll go over some of our proposed
solutions, and then we have some experimental results. So you can think of the system we’re
employing a classifier that gets retrained periodically overtime. In between each retraining,
we get a mix of old and new examples. You can think of these as slightly perturbed examples
or maybe we just wanted to test to see if our classifiers learned something new. But
the key is that we had a small additional cost for making the same errors we’ve already
made. But–look much longer it cost for making new errors. And one intuition for this is
that humans are making correcting errors through some process. So if we think of two of these
systems, one which has the classifier which makes consistent errors–that is the errors
between retrainings are the same, more or less. And we think of another system which
uses a classifier which every time we retrain–it’s a distinct set of errors, there’s going to
be a big difference in the amount of human effort to maintain this system. So if we think
of this blue rectangle as our future space and these pink circles are errors made by
classifiers on previous retrainings. And the red circles are errors made in the current
classifier. If we have a system that the classifier that makes distinct errors, overtime we’re
going to make errors on many more data points. So this is going to represent the huge human
costs for maintaining this classifier. And if it’s bad enough we might as well just not
be doing classification. Whereas, if we have consistent errors in the classifier, we’re
going to make many fewer incremental errors and overtime we’re going to get a much smaller
space of these classified examples corresponding to a much lower human cost. So in general
we have a lost function which it depends on our previous classifications, and specifically
we’re going to look at the case where there’s a low cost to repeating errors. Now the key
is that we had incomplete feedback and that human’s corrections–human corrections take
time to even if we do get corrections per points, we’re just definitely not certain–it’s
going to be delayed quite a bit. And we may need to retrain several times before we have
any feedback. So the question we wanted to ask is can we learn while minimizing new errors,
even if we don’t know what the old errors are. So our first method is quite simple.
It’s a good baseline. We just took a linear combination or previous hypothesis. And exponentially
weighed averaging is–it was too good that it worked. And we’re using linear models so
averaging weights is equivalent to averaging classifications. And this is because it’s
a simple baseline. It’s also a currently intuitive method. And that we can think of classifications
as a random variable. And by taking a linear combination we’re reducing the variants. So
that’s going to decrease the chance that we’re going to change our classifications. We also
tried several kinds of warm-starts. So many ideas’ that we have some hypothesis and we’re
want to make some updates to it but we don’t want to update it too much. So you can use
your application steps or we can use smaller step sizes. And more definitely we can use
any learning algorithms with the caveat that we want to–we don’t want to update too much.
In general I’m not in learning would say, “We have a function and we want to fit it”,
as well as we can–given this new data. Whereas here, we’re saying, “We want to fit the function
a little bit better but we also want to stay consistent. So effectively we’re using a different
laws function. We’re training with standard laws function and our actual laws function
reweighs the errors that we make. We also tried a weight nearness constraint, so here
we’re doing a full optimization but with a hard constraint. So you can start out with
a prior hypothesis WT. We train a small update which make the at most delta, and then we
can add that to our previous hypothesis and get this constraint update. Finally we tried
adding several regularization terms to the optimization. So we have two different thing
we tried. One was a squared error which says the predictions of the current model should
be as close as possible to the predictions of the previous model. We also have the hinge
term which says the prediction of the current model should have the same sign as the predictions
of the previous model with the maximum margin. And this is actually equivalent to virtual
examples in the training set. So instead of using the two classifications of the points
in the training set, we would add points with the classifications given by the previous
model. And these are going to be weighed in the sense that if we had some co-efficient
of the regularization term then that’s also the way we’re going to use on the virtual
examples.>>Did the variable of WT comes down? Was
the variable–why did he able to change WT plus one [INDISTINCT]
>>LAVOIE: So, you can think of WT plus one as retraining from scratch. And then–so we
have a whole list of weights that we’ve retrained from scratch. And we just want to make sure
that we’re close to the previous weight factor.>>And HT of LT they sign probably, three
to ten? They…>>LAVOIE: Oh, that’s the–so if you think
of hypothesis or weights–so the previous hypothesis would be including averaging. So
we’re not just using the previous weight factor…>>I thought of…
>>LAVOIE: …were using it with averaging–or not averaging, sorry. I mean there on having
to manage but–yeah, it’s just the classification of the previous hypothesis. And that’s actually
a classification inner product. As far as evaluation goes we actually have two things
we’re looking to optimize. Phases to who model it–we wanted discriminative model. And that’s
some sort of instantaneous requirements. So you can think of–in comparison to a baseline,
we want to make sure that, when we’re constraining the model updates that we’re not constraining
of to the point that we’re inhibiting learning. So we want to do well in comparison to an
unconstrained model in terms of discriminative ability. But at the same time we care about
the consistency of the system. So if we managed this with cumulative unique false positives.
And these false positives are not just cumulative error. He has a great analogy to trying to
drill oil wells. So if you’re predicting which places are good to drill and you have a false
positive, this could be a huge cost if you actually drill down and test for oil. But
once you’ve already drilled there’s no extra cost to having a classifier which tells you,
you should drill here again, because you’re probably not going to go back to the same
place and drill another well. False negatives also have some cost but there’s not the same
permanent cost to correct it. So here is why we use cumulative false positive. So the actual
reason we used it is because at Google on this status that we determine that the cost
was higher, but that’s a good analogy. And this is sort of measure of overall performance
of the system. And the exact definition is–if we look at the sets of false positives made
by the hypothesis–each of our hypothesis, this is just the union of those false positives.
And in either case, were allowing hypothesis to be trained and all our previous data. But
we’re only going to be test them on our new data. And that–so it’s easy to be calculated
just on the most recent side of data that we’ve seen. And were only going to update
cumulative false positives for that newest data. As far as the actual data goes, we used
a Google data set which is adversarial advertisings. And this is a–as far as the–very confidential
data set. This is one reason we used linear models here. We also–unfortunately our data
set is not public, so we also used malicious URL identification data set. This is by [INDISTINCT].
And this is quantitatively similar but it’s a public data set. As far as the actual results,
so each of these plots–the X axis is the number of times we’ve retrained and the Y
axis is the performance of the bylaws relative to a control. So the thick black line is our
control and this is percent difference less than zero. On the left, we have the accumulatively
false positives which is the measure of the system performance as a whole. And this is
our instantaneous discriminative ability. So you can see that weight based methods did
quite well. We get about a 40% reduction in accumulative being false positives on this
data set. And we actually increase AUC slightly. But in general, we’re just hoping not to change
it too much. So that was a little bit surprising. What is also surprising is that our regularization
terms did not work out. They didn’t work at all of this data set. And one hypothesis we
have is that it’s not difficult to come up with a hypothesis that is close to our previous
hypothesis on the training data, while still giving wildly different results on test data.
So one way to fix this would be to actually use unlabeled data to get a better estimate
of the regularization term and actually avoid overfitting it. So, that’s filed under future
work. As for malicious URL classification, we see similar–very similar results. So the
way based methods did quite well. The regularization term is actually worked out a little bit.
And here we are increasing AUC which is slightly comforting, but we’re not doing it to a new
extent that’s non-trivial. So this is actually a 50% reduction in the accumulative being
false positive rate. So you can think of the cost of maintaining this system in terms of
human time required, we’ve about cut it in half without seriously changing AUC. So that’s
a really interesting result. And this to summarize, we presented the problem of History Dependent
Domain Adaptation where our laws depends on our previous classifications and this has
many real-world applications. We’ve evaluated several solutions and we got up to a 50% reduction
in accumulative being false positives while maintaining a very high AUC–but there’s lots
still to do. So, obviously very little theory in this stock. Why do certain methods outperform
others? And can we do better? There’s–I don’t know if any theory that says there’s an absolute
trade off and indeed for a lot of these methods, we found that there wasn’t an absolute trade
off between instantaneous performance and overall system performance. So that would
definitely be an interesting area of study. And also where can we make use of unlabeled
data? So I gave one example of our regularization terms–we think they’re maybe a couple of
other places we can use unlabeled data. So that’s it, thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *