A cloud, for our purposes, is a shared-nothing, networked group of
computers that we can use to run some computation in parallel on a
massive dataset.
A typical application, search engine log analysis, see
www.google.com/trends
Example log file, terabytes per day.
query?Volcno Bat
query?Island palm tree
images?volcano
...
Workflow to analyze
-
Write to lower case
-
map to correct spelling
-
aggregate and count
Need a dataflow language to manage each step.
Assume each set of words is a list.
- Initially words
["Bat", "Volcno", "bat"]
-
1) Map to lower case
["bat", "volcno", "bat"]
-
Map to correct spelling
["bat", "volcano", "bat"]
-
[(1, "volcano"), (2, "bat")]
Map/Reduce
In functional programming, this is "map" and "reduce" which are
higher-order functions.
A higher-order function takes a function as a parameter (or produces
a function as a result). The
classic example is map which applies a function
to every element in a list. The higer order functions map
and reduce (fold) are build into functional
languages like Haskell.
map (\x -> x * x) [1, 2, 3, 4]
would result in
[1, 4, 9, 16]
reduce (*) 1 [1, 2, 3, 4]
would result in
24
which is
1 * 1 * 2 * 3 * 4
So the factorial can also be defined as
factorial n = reduce (*) 1 [1..n]
Of course, if we lacked a specific function, we could always create it.
For instance, here is an implementation of the map function.
mapImpl f [] = []
mapImpl f (x:xs) = (f x)::(mapImpl f xs)
Pig Latin
Pig is a dataflow language, built on top of a map/reduce
architecture (Hadoop).
Kinds of objects
-
relations (a bag)
-
a bag is a set of tuples
-
a tuples is a list of fields
-
a field is a piece of data
Alias is a name bound to an object.
The following loads some data
A = LOAD 'actor.csv' USING PigStorage(',') AS (id:int, name:chararray);
To look at the data.
DUMP A;
To store the data.
STORE A;
Projection, create an iterator over a column.
B = FOREACH A GENERATE name;
Selection, use a filter.
C = FILTER A BY id < 20;
Join
E = LOAD 'address.csv' USING PigStorage(',') AS (name:chararray, address:chararray);
D = JOIN A BY name, E BY name;
Grouping creates a bag of tuples with the group-by values.
M = FOREACH A GENERATE id % 3 as mod, name;
N = GROUP M By (mod);
X = FOREACH N GENERATE mod, COUNT(name);
|