>>LAVOIE: Mates. Thank you. I’m Allen Lavoie,

I’ll be presenting History Dependent Domain Adaptation. And this is joint work with co-authors

from Google Pittsburgh, Matt Otey, Nathan Ratliff, and D. Sculley, who’s here today.

So the main idea of this paper is that standard methods and machine learning are often too

myopic for the domains that they’re applying to. So typically we’re focused on minimizing

errors now and in the future. But many times machine learning is applied to systems which

have an expectation [INDISTINCT] consistency overtime. So I’ll start by giving an example

and then I’ll formalize the problem just a little bit and we’ll go over some of our proposed

solutions, and then we have some experimental results. So you can think of the system we’re

employing a classifier that gets retrained periodically overtime. In between each retraining,

we get a mix of old and new examples. You can think of these as slightly perturbed examples

or maybe we just wanted to test to see if our classifiers learned something new. But

the key is that we had a small additional cost for making the same errors we’ve already

made. But–look much longer it cost for making new errors. And one intuition for this is

that humans are making correcting errors through some process. So if we think of two of these

systems, one which has the classifier which makes consistent errors–that is the errors

between retrainings are the same, more or less. And we think of another system which

uses a classifier which every time we retrain–it’s a distinct set of errors, there’s going to

be a big difference in the amount of human effort to maintain this system. So if we think

of this blue rectangle as our future space and these pink circles are errors made by

classifiers on previous retrainings. And the red circles are errors made in the current

classifier. If we have a system that the classifier that makes distinct errors, overtime we’re

going to make errors on many more data points. So this is going to represent the huge human

costs for maintaining this classifier. And if it’s bad enough we might as well just not

be doing classification. Whereas, if we have consistent errors in the classifier, we’re

going to make many fewer incremental errors and overtime we’re going to get a much smaller

space of these classified examples corresponding to a much lower human cost. So in general

we have a lost function which it depends on our previous classifications, and specifically

we’re going to look at the case where there’s a low cost to repeating errors. Now the key

is that we had incomplete feedback and that human’s corrections–human corrections take

time to even if we do get corrections per points, we’re just definitely not certain–it’s

going to be delayed quite a bit. And we may need to retrain several times before we have

any feedback. So the question we wanted to ask is can we learn while minimizing new errors,

even if we don’t know what the old errors are. So our first method is quite simple.

It’s a good baseline. We just took a linear combination or previous hypothesis. And exponentially

weighed averaging is–it was too good that it worked. And we’re using linear models so

averaging weights is equivalent to averaging classifications. And this is because it’s

a simple baseline. It’s also a currently intuitive method. And that we can think of classifications

as a random variable. And by taking a linear combination we’re reducing the variants. So

that’s going to decrease the chance that we’re going to change our classifications. We also

tried several kinds of warm-starts. So many ideas’ that we have some hypothesis and we’re

want to make some updates to it but we don’t want to update it too much. So you can use

your application steps or we can use smaller step sizes. And more definitely we can use

any learning algorithms with the caveat that we want to–we don’t want to update too much.

In general I’m not in learning would say, “We have a function and we want to fit it”,

as well as we can–given this new data. Whereas here, we’re saying, “We want to fit the function

a little bit better but we also want to stay consistent. So effectively we’re using a different

laws function. We’re training with standard laws function and our actual laws function

reweighs the errors that we make. We also tried a weight nearness constraint, so here

we’re doing a full optimization but with a hard constraint. So you can start out with

a prior hypothesis WT. We train a small update which make the at most delta, and then we

can add that to our previous hypothesis and get this constraint update. Finally we tried

adding several regularization terms to the optimization. So we have two different thing

we tried. One was a squared error which says the predictions of the current model should

be as close as possible to the predictions of the previous model. We also have the hinge

term which says the prediction of the current model should have the same sign as the predictions

of the previous model with the maximum margin. And this is actually equivalent to virtual

examples in the training set. So instead of using the two classifications of the points

in the training set, we would add points with the classifications given by the previous

model. And these are going to be weighed in the sense that if we had some co-efficient

of the regularization term then that’s also the way we’re going to use on the virtual

examples.>>Did the variable of WT comes down? Was

the variable–why did he able to change WT plus one [INDISTINCT]

>>LAVOIE: So, you can think of WT plus one as retraining from scratch. And then–so we

have a whole list of weights that we’ve retrained from scratch. And we just want to make sure

that we’re close to the previous weight factor.>>And HT of LT they sign probably, three

to ten? They…>>LAVOIE: Oh, that’s the–so if you think

of hypothesis or weights–so the previous hypothesis would be including averaging. So

we’re not just using the previous weight factor…>>I thought of…

>>LAVOIE: …were using it with averaging–or not averaging, sorry. I mean there on having

to manage but–yeah, it’s just the classification of the previous hypothesis. And that’s actually

a classification inner product. As far as evaluation goes we actually have two things

we’re looking to optimize. Phases to who model it–we wanted discriminative model. And that’s

some sort of instantaneous requirements. So you can think of–in comparison to a baseline,

we want to make sure that, when we’re constraining the model updates that we’re not constraining

of to the point that we’re inhibiting learning. So we want to do well in comparison to an

unconstrained model in terms of discriminative ability. But at the same time we care about

the consistency of the system. So if we managed this with cumulative unique false positives.

And these false positives are not just cumulative error. He has a great analogy to trying to

drill oil wells. So if you’re predicting which places are good to drill and you have a false

positive, this could be a huge cost if you actually drill down and test for oil. But

once you’ve already drilled there’s no extra cost to having a classifier which tells you,

you should drill here again, because you’re probably not going to go back to the same

place and drill another well. False negatives also have some cost but there’s not the same

permanent cost to correct it. So here is why we use cumulative false positive. So the actual

reason we used it is because at Google on this status that we determine that the cost

was higher, but that’s a good analogy. And this is sort of measure of overall performance

of the system. And the exact definition is–if we look at the sets of false positives made

by the hypothesis–each of our hypothesis, this is just the union of those false positives.

And in either case, were allowing hypothesis to be trained and all our previous data. But

we’re only going to be test them on our new data. And that–so it’s easy to be calculated

just on the most recent side of data that we’ve seen. And were only going to update

cumulative false positives for that newest data. As far as the actual data goes, we used

a Google data set which is adversarial advertisings. And this is a–as far as the–very confidential

data set. This is one reason we used linear models here. We also–unfortunately our data

set is not public, so we also used malicious URL identification data set. This is by [INDISTINCT].

And this is quantitatively similar but it’s a public data set. As far as the actual results,

so each of these plots–the X axis is the number of times we’ve retrained and the Y

axis is the performance of the bylaws relative to a control. So the thick black line is our

control and this is percent difference less than zero. On the left, we have the accumulatively

false positives which is the measure of the system performance as a whole. And this is

our instantaneous discriminative ability. So you can see that weight based methods did

quite well. We get about a 40% reduction in accumulative being false positives on this

data set. And we actually increase AUC slightly. But in general, we’re just hoping not to change

it too much. So that was a little bit surprising. What is also surprising is that our regularization

terms did not work out. They didn’t work at all of this data set. And one hypothesis we

have is that it’s not difficult to come up with a hypothesis that is close to our previous

hypothesis on the training data, while still giving wildly different results on test data.

So one way to fix this would be to actually use unlabeled data to get a better estimate

of the regularization term and actually avoid overfitting it. So, that’s filed under future

work. As for malicious URL classification, we see similar–very similar results. So the

way based methods did quite well. The regularization term is actually worked out a little bit.

And here we are increasing AUC which is slightly comforting, but we’re not doing it to a new

extent that’s non-trivial. So this is actually a 50% reduction in the accumulative being

false positive rate. So you can think of the cost of maintaining this system in terms of

human time required, we’ve about cut it in half without seriously changing AUC. So that’s

a really interesting result. And this to summarize, we presented the problem of History Dependent

Domain Adaptation where our laws depends on our previous classifications and this has

many real-world applications. We’ve evaluated several solutions and we got up to a 50% reduction

in accumulative being false positives while maintaining a very high AUC–but there’s lots

still to do. So, obviously very little theory in this stock. Why do certain methods outperform

others? And can we do better? There’s–I don’t know if any theory that says there’s an absolute

trade off and indeed for a lot of these methods, we found that there wasn’t an absolute trade

off between instantaneous performance and overall system performance. So that would

definitely be an interesting area of study. And also where can we make use of unlabeled

data? So I gave one example of our regularization terms–we think they’re maybe a couple of

other places we can use unlabeled data. So that’s it, thank you.