HTTP Archive – The State of the Web


SPEAKER: You’re a web
developer wondering what a reasonable
JavaScript bundle size is, or you’re a PhD student
researching HTTP/2 Adoption. Perhaps you’re on a web
standards committee, and you’re surveying how
an API is used in the wild, where do you go to find answers
to big questions about the web? In this episode, we’re
looking at the HTTP Archive and how you can use
it to learn more about the state of the web. [MUSIC PLAYING] You can think of
the HTTP Archive as a data center
full of machines continuously testing
hundreds of thousands of the most popular websites
and recording everything there is to know about
them, from how many bytes of JavaScript are loaded,
how long it took to download them, if any images could have
been optimized, and much, much more. As you can imagine,
with this much data, we can learn some pretty
amazing things about the web. So how do we start
making sense of it? HttpArchive.org is the place
to go for web stats and trends at your fingertips. Common questions about
the web, like the size of the typical web
page or HTTPS adoption, are all answered here. You can even go through seven
years of historical data to see how the web
has evolved and where the trends are taking us. As a community of web
developers, this kind of data is crucial to know if we need
any sort of course correction or get confirmation that
we are, in fact, heading in the right direction. And this is actually
a really good time to be using the HTTP
Archive, because there’s a new version being
released in early 2018 with an upgraded UI and
lots more modern metrics. But what if the stats
you’re interested in are so specific that it’s
not available on the website? This is where BigQuery comes in. The HTTP Archive data is like an
iceberg, with a hand-picked set of interesting metrics
exposed on the website, but so much more to be
explored beneath the surface. On BigQuery, you could
mine terabytes of raw data using simple SQL queries. So let’s dive in and see
what kind of insights we can extract. The Summary Pages data set
contains high level data for each crawl. Many of the stats surfaced
on the website come directly from here. We can go deeper into the
raw lighthouse results to learn more about the
progressiveness of the web. For example, we can query for
how many websites pass or fail the audit that checks whether
a service worker is installed. According to the data,
about 2,400 or 0.6% of the sites tested
actually have one. When we compare this to the
available historical data, there is a clear
increasing trend. Let’s look into
another hot topic on the web, which is the
use of cryptocurrency mining JavaScript. Since every request is
logged by HTTP Archive, we can query for patterns
like whether the URL includes a known mining library. Of course, there are ways
to conceal it from the URL, and there could be
false positives, but this gives us a rough idea. Things really heat
up when we start exploring the
particular websites that include such code. For example, what will we find
if we limit our search to .gov or .edu websites? So those are just a few examples
of the power of the HTTP Archive. It’s a super useful
tool for learning about how the web is built.
In the upcoming episodes, HTTP Archive data
will form the basis of many more of our insights. Also, be sure to check out
Discuss.HttpArchive.org, where people get SQL help, share
interesting analysis, and stay afloat with
the latest changes. Thanks for watching.

Comments 4

Leave a Reply

Your email address will not be published. Required fields are marked *