On June 9, the Wall Street Journal
that for the last few years the National Security Agency has been
relying on a software program “with the quirky name Hadoop” to help it
make sense of its enormous collections of data. Named after a toy
elephant that belonged to the child of one of the original developers of
the program, “Hadoop,” reported the Journal, is a crucial part of “a
computing and software revolution … a piece of free software that lets
users distribute big-data projects across hundreds or thousands of
computers.”
“Revolution” is probably the most overused word in the
chronicle of Internet history, but if anything, the Wall Street Journal
undersold the real story. Hadoop’s importance to how we live our lives
today is hard to overstate. By making it economically feasible to
extract meaning from the massive streams of data that increasingly
define our online existence, Hadoop effectively enabled the surveillance
state.
And not just in the narrowest, Big Brother,
government-is-watching-everyone-all-the-time sense of that term. Hadoop
is equally critical to
corporate surveillance.
Facebook, Twitter, Yahoo, Amazon, Netflix — just about every big player
that gathers the trillions of data “events” generated by our everyday
online actions employs Hadoop as a part of their arsenal of Big
Data-crunching tools. Hadoop is everywhere — as one programmer told me,
“it’s taken over the world.”
The Journal’s description of Hadoop
as “a piece of free software” barely scratches the surface of the
significance of this particular batch of code. In the past half-decade
Hadoop has emerged as one of the triumphs of the non-proprietary,
open-source software programming methodology that previously gave us the
Apache Web server, the Linux operating system and the Firefox browser.
Hadoop belongs to nobody. Anyone can copy it, modify, extend it as they
please. Funny, that: A software program developed collaboratively by
programmers who believe that their code should be shared in as open and
transparent a process as possible has resulted in the creation of tools
that everyone from the NSA to Facebook uses to annihilate any semblance
of individual privacy. But what’s even more ironic, and fascinating, is
the sight of intelligence agencies like the NSA and CIA joining in and
becoming
in the world of open source big data software. The NSA doesn’t just
Hadoop. NSA programmers have improved and extended Hadoop and donated
their changes and additions back to the larger community. The CIA
They’re all in it together.
The spooks and the social media titans and the online commerce goliaths
are collaborating to improve data-crunching software tools that enable
the tracking of our behavior in fantastically intimate ways that simply
weren’t possible as recently as four or five years ago. It’s a new
military industrial open source Big Data complex. The gift economy has
delivered us the surveillance state.
Hadoop’s
earliest roots
go back to 2002, when Doug Cutting, then the search director at the
Internet Archive, and Michael Cafarella, a graduate student at the
University of Washington, started working on an open-source search
engine called “Nutch.” But the project did not get serious traction
until Cutting joined Yahoo and began to merge his work into Yahoo’s
larger strategic goal of improving its search engine technology so as to
better compete with Google. Significantly, Yahoo executives decided not
to make the project proprietary. In 2006, they blessed the formation of
Hadoop, an open-source project managed under the auspices of the
Apache Software Foundation. (For a much more detailed look at the history of Hadoop, please read this
four-part history of Hadoop at GigaOm.)
Hadoop is basically
a nifty hack. The definition, per Wikipedia,
is surprisingly simple:
“It supports the running of applications on large clusters of commodity
hardware.” Bottom line, Hadoop provides a means for distributing both
the storage and processing of an enormous amount of data over lots and
lots of relatively inexpensive computers. Hadoop turned out to be cheap,
fast and scalable — meaning it could expand smoothly in capacity as the
flows of data it was crunching burgeoned in size, simply though
plugging in extra computers to the network. Hadoop was also
fundamentally modular — different parts of it could be easily replaced
by custom designed chunks of software, making it seamlessly adaptable to
the individual circumstances of different corporations — or government
agencies.
Hadoop’s debut was timely, addressing not only the
problems Yahoo faced in managing the enormous amounts of data produced
by its users, but also those that the entire Internet industry was
simultaneously struggling to cope with. Basically, the Internet had
become a victim of its own success. The enormous flows of data generated
by users of the likes of Facebook and Twitter far overwhelmed the
ability of those companies to make sense of it. There was too much
coming in too fast. Hadoop helped companies cope with the tsunami — it
was,
in the words of Jeff Hammerbacher, an early employee of Facebook, “our tool for exploiting the unreasonable effectiveness of data.”
Before
Hadoop, you were at the mercy of your data. After Hadoop, you were in
charge. You could figure out all kinds of interesting things. You could
recognize patterns in the data and start to make inferences about what
might happen if you made tweaks to your product. What did users do when
the interface was adjusted like
this? What kinds of ads made
them more likely to pull out their credit cards? What did that batch of
millions of Verizon calls reveal about the formation of a potential
terrorist cell? Facebook wouldn’t be able to exploit the insights of its
so-called
social graph without tools like Hadoop.
“Hadoop
has become the de facto standard tool for cost-effectively processing
Big Data,” says Raymie Stata, who served as chief technology officer at
Yahoo before eventually starting his own Hadoop-focused start-up,
Altiscale.
And the significance of being able to cheaply process Big Data, to
accurately “measure” what your users are doing, he added, is a “big
deal.”
“Once you can measure what’s happening ‘out there’ — [you
can] then use those measurements to understand and ultimately influence
what’s happening out there.”
With engineers at multiple companies
recognizing that Hadoop offered solutions to the specific challenges
they faced on a daily basis, Hadoop quickly secured the critical mass of
cross-industry support necessary for an open-source software program to
become an essential part of Internet infrastructure. Even engineers at
Google chipped in, although Hadoop, at its core, was basically an
attempt to reverse-engineer proprietary Google technology. But that’s
just how the Internet has historically worked. For decades, so-called
gift economy collaboration, in which the community as a whole benefits
from the freely donated contributions of its members, has been a potent
driver of Internet software evolution. As I wrote 16 years ago, when
chronicling the birth of the Apache Web server,
the success of open source software “testifies to the enduring vigor of
the Internet’s cooperative, distributed approach to solving problems.”
Hadoop, which down to its fundamental structural essence
is a distributed approach to solving problems, emblematized this philosophy at its core.
So,
in a sense, Hadoop’s success was just the same old story. But back in
the mid-’90s, around the time that one of the first open source success
stories, the Apache Web server, was taking off, I’m not sure that anyone
would have predicted that the National Security Agency and CIA would
end up becoming stalwart participants in the gift economy. Even though
it makes total sense,
in principle, that the fruits of
government-funded software development should be shared with the general
public, there’s still something cognitively disjunctive about
intelligence agencies that shroud their every activity in great secrecy
contributing to projects built on openness and transparency. On the one
hand, employees of the NSA are appearing at conferences discussing how
they have adapted Hadoop to solve the problems of dealing with
unimaginably huge data sets, but on the other hand, we’re not supposed to know anything about what they are actually doing with that data.
The
intertwining of the intelligence agencies with the larger open source
software community could hardly be more incestuous. In 2008, a group of
Yahoo employees that eventually included Doug Cutting
formed a start-up designed to commercialize Hadoop called
Cloudera. The CIA, through its In-Q-Tel (named after James Bond’s Q character) venture capital arm,
was an early investor in, and customer of, Cloudera. The NSA built a significant piece of software that works “on top” of Hadoop called
Accumulo designed to add sophisticated
security controls managing how data could be accessed, and then promptly
donated that code to the Apache Software Foundation. Later, a group of NSA software engineers formed another spinoff company,
Sqrrl, to commercialize Accumulo.
What all this means is that the improvements to tools that the NSA is making, with the aim of
more efficiently catching terrorists,
are propagating into the private sector where they will be used by
Facebook and Neftlix and Yahoo to more accurately target ads or
influence our purchasing behavior or provide us with content
algorithmically shaped
to our very specific desires.
And vice versa. Innovations and increased capabilities pioneered by
private companies trickle back to the NSA. The collective boot-strapping
never stops.
Again, in principle, there is nothing necessarily
wrong
going on here. There is no one to blame. Some of the fiercer apologists
for unfettered free markets might complain that government involvement
in open source projects unfairly competes with private sector
proprietary businesses, but a much stronger case can be made that any
software development work that is funded by taxpayer money should
by definition be considered freely sharable with the wider public. The NSA should
probably
be applauded for helping to improve Hadoop. And if the capabilities
unlocked by Hadoop result in the prevention of some horrific terrorist
act, then every programmer who contributed a line of code to the project
justly deserves some congratulation.
But there’s also an intriguing inversion occurring here of what, for better or worse, we might call the
purpose
of the Internet. The Internet was initially created by the U.S.
government to facilitate the sharing of information between
geographically separate research centers. The Internet took off in the
mid-’90s in large part because the general public recognized it as a
phenomenal tool for sharing information with each other. The fact that
so much of the Internet’s infrastructure was also built from code that
was freely shared seemed like a pleasing match of form and function.
Free
software and open-source software evolution is frequently driven not so
much by hope for financial gain but by individuals looking to solve
their immediate engineering problems. Over time, on the Internet at
large, one of those problems has turned out to be the gnarly challenge
of how to manage all the data created by all those people sharing so
promiscuously with each other. Hadoop can justly be seen as the natural
response to all that promiscuous sharing. And it certainly helped solve
the problems faced by engineers at Facebook and elsewhere.
But
what ended up getting enabled by the success of Hadoop is something
significantly different than good old peer-to-peer sharing. The ability
to make sense out of petabytes of data isn’t necessarily useful to you
or me. But it’s god’s gift to the profit-minded corporations and
terrorist-seeking intelligence agencies seeking to leverage the data we
generate for their own purposes, to measure our behavior and ultimately
to influence it. That could mean Netflix figuring out exactly what
combination of plot twists and acting talent proves irresistible to
streaming video watchers or Facebook figuring out exactly how to stock
our newsfeeds with advertisements that generate acceptable click-through
or Twitter knowing exactly where we are on the surface of the planet so
it can pop up a sponsored tweet pushing a coupon for a happy hour at
the bar just down the street — or the NSA spotting a peculiar pattern of
pressure cooker purchases. This is no longer about sharing information
with each other; it’s about manipulation, control and punishment. It’s
about keeping stock prices up. We’re a long, long way here from the
ideal gift economy, where everyone brings their home-cooked delicacy to
the potlatch. We’ve arrived at a destination where the tools offer more
power to
them than to
us.
I posed a version of
this analysis to Michael Cafarella, one of the original authors of
Hadoop, now a computer scientist at the University of Michigan. He
conceded that “there’s a certain irony that the open ideas of open
source have enabled the construction of systems that can undermine
openness so substantially.”
But Raymie Stata, who has been closely
involved with the growth of Hadoop for the last seven years, warned
against “conflating ‘open source software’ with ‘Open Society.’”
“Everyone
involved with Hadoop in the early days certainly did believe that
Hadoop, as a piece of open source software, would make the world a
better place. I can’t say, back then, that we saw Hadoop moving from
cyberspace to the real world, but we did recognize that it would become
foundational to building Internet applications of the future, and we
wanted to contribute to advancing that agenda.
“But individuals
who find common ground in contributing to open source projects do not,
as a whole, share beliefs on what constitutes the ideal ‘Open Society,’”
said Stata. “Is using Big Data to make inferences about people a Bad
Thing at all, no matter who does it? Or is it no big deal? Or does it
depend on who’s doing it, and for what reason (and with what
transparency)? Should we be more worried about Big Business, or Big
Government?”
“I guess in some ways this incident is evidence that
it’s hard to encode ideals in a piece of software,” said Cafarella. “The
right way to do that is via legislation.”
Cafarella’s point is
hard to dispute. Brian Behlendorf, one of the founders of the Apache
Software Foundation, told me that at one juncture, contributors to the
various software projects managed by Apache had argued over whether the
license that determined the rules for how their code could be shared
should include restrictions against organizations using that code for
purposes deemed morally or ethically unacceptable by the open source
software programmer community. But it was relatively quickly determined
that to attempt such restrictions would open up an impossible to resolve
subjective can of worms. Society at large has to figure out what limits
it wants to put on the surveillance state, on what either Facebook or
the NSA is allowed to do.
It’s also important to acknowledge that
as users of online services, we benefit in many ways from our
instant-gratification, access-to-everything, always on lives. But still:
When we first started to log on, did we realize what the tradeoffs
would be? Did we know that we were entering the Panopticon? That we
would be making it substantially
easier than ever before for governments and businesses to track our behavior and monitor our every whim?
Behlendorf
says we kind of did. He recalls his days, fresh out of college in 1995,
working for HotWired, Wired magazine’s first foray into online
publishing. AT&T was running an ad on HotWired, under the theme
“Imagine the Future,” that pictured an arm with a “wrist-watch phone” on
it.
“Someone printed it out,” said Behlendorf, “put it up on the
wall, and wrote in black marker over the top of the ad, ‘NSA primate
tracking device.’”
And guess what? We went ahead and built it.
No comments:
Post a Comment