The Aspect-Oriented Data Project 
Cloud Computing Lecture Notes
 
 
HomeC  Home
downloadC  Software
teachingC  Classroom
people  People
papers  Papers
 
A cloud, for our purposes, is a shared-nothing, networked group of computers that we can use to run some computation in parallel on a massive dataset.

A typical application, search engine log analysis, see
www.google.com/trends

Example log file, terabytes per day.

  query?Volcno Bat
  query?Island palm tree
  images?volcano
  ...

Workflow to analyze

  1. Write to lower case
  2. map to correct spelling
  3. aggregate and count

Need a dataflow language to manage each step. Assume each set of words is a list.

  1. Initially words
       ["Bat", "Volcno", "bat"]
    
  2. 1) Map to lower case
       ["bat", "volcno", "bat"]
    
  3. Map to correct spelling
       ["bat", "volcano", "bat"]
    
  4. [(1, "volcano"), (2, "bat")]

Map/Reduce

In functional programming, this is "map" and "reduce" which are higher-order functions. A higher-order function takes a function as a parameter (or produces a function as a result). The classic example is map which applies a function to every element in a list. The higer order functions map and reduce (fold) are build into functional languages like Haskell.
   map (\x -> x * x) [1, 2, 3, 4]
would result in
   [1, 4, 9, 16]
   reduce (*) 1 [1, 2, 3, 4]
would result in
   24
which is
     1 * 1 * 2 * 3 * 4
So the factorial can also be defined as
   factorial n = reduce (*) 1 [1..n]

Of course, if we lacked a specific function, we could always create it. For instance, here is an implementation of the map function.

    mapImpl f [] = []
    mapImpl f (x:xs) = (f x)::(mapImpl f xs)

Pig Latin

Pig is a dataflow language, built on top of a map/reduce architecture (Hadoop).

Kinds of objects

  • relations (a bag)
  • a bag is a set of tuples
  • a tuples is a list of fields
  • a field is a piece of data
Alias is a name bound to an object.

The following loads some data

  A = LOAD 'actor.csv' USING PigStorage(',') AS (id:int, name:chararray);

To look at the data.

  DUMP A;

To store the data.

  STORE A;

Projection, create an iterator over a column.

  B = FOREACH A GENERATE name;

Selection, use a filter.

  C = FILTER A BY id < 20;

Join

  E = LOAD 'address.csv' USING PigStorage(',') AS (name:chararray, address:chararray);
  D = JOIN A BY name, E BY name;

Grouping creates a bag of tuples with the group-by values.

  M = FOREACH A GENERATE id % 3 as mod, name;
  N = GROUP M By (mod);
  X = FOREACH N GENERATE mod, COUNT(name);

                                                                                                                                     

 
E-mail questions or comments to or Curtis dot Dyreson at usu dot edu