Cloud Computing Lecture Notes - <courseSemester/>

The Aspect-Oriented Data Project

Cloud Computing Lecture Notes

	Home
	Software
	Classroom
	People
	Papers

A cloud, for our purposes, is a shared-nothing, networked group of computers that we can use to run some computation in parallel on a massive dataset.

A typical application, search engine log analysis, see
www.google.com/trends

Example log file, terabytes per day.

  query?Volcno Bat
  query?Island palm tree
  images?volcano
  ...

Workflow to analyze

Write to lower case
map to correct spelling
aggregate and count

Need a dataflow language to manage each step. Assume each set of words is a list.

Initially words
```
   ["Bat", "Volcno", "bat"]
```
1) Map to lower case
```
   ["bat", "volcno", "bat"]
```
Map to correct spelling
```
   ["bat", "volcano", "bat"]
```
[(1, "volcano"), (2, "bat")]

Map/Reduce

In functional programming, this is "map" and "reduce" which are higher-order functions. A higher-order function takes a function as a parameter (or produces a function as a result). The classic example is map which applies a function to every element in a list. The higer order functions map and reduce (fold) are build into functional languages like Haskell.

   map (\x -> x * x) [1, 2, 3, 4]

would result in

   [1, 4, 9, 16]

   reduce (*) 1 [1, 2, 3, 4]

would result in

which is

     1 * 1 * 2 * 3 * 4

So the factorial can also be defined as

   factorial n = reduce (*) 1 [1..n]

Of course, if we lacked a specific function, we could always create it. For instance, here is an implementation of the map function.

    mapImpl f [] = []
    mapImpl f (x:xs) = (f x)::(mapImpl f xs)

Pig Latin

Pig is a dataflow language, built on top of a map/reduce architecture (Hadoop).

Kinds of objects

relations (a bag)
a bag is a set of tuples
a tuples is a list of fields
a field is a piece of data

Alias is a name bound to an object.

The following loads some data

  A = LOAD 'actor.csv' USING PigStorage(',') AS (id:int, name:chararray);

To look at the data.

  DUMP A;

To store the data.

  STORE A;

Projection, create an iterator over a column.

  B = FOREACH A GENERATE name;

Selection, use a filter.

  C = FILTER A BY id < 20;

Join

  E = LOAD 'address.csv' USING PigStorage(',') AS (name:chararray, address:chararray);
  D = JOIN A BY name, E BY name;

Grouping creates a bag of tuples with the group-by values.

  M = FOREACH A GENERATE id % 3 as mod, name;
  N = GROUP M By (mod);
  X = FOREACH N GENERATE mod, COUNT(name);

E-mail questions or comments to or Curtis dot Dyreson at usu dot edu