The Aspect-Oriented Data Project 
homework/printhwPig
 
 
HomeC  Home
downloadC  Software
teachingC  Classroom
people  People
papers  Papers
 
Due date: .
Instructions: Upload your homework using the homework turnin page. Turn in a Pig script called script.pig that contains the Pig code to peform the queries listed later in this assignment.

Pig

The following links have information on Pig.

Download the the latest release of Pig. Pig runs on Linux/Unix or Windows (with Cygwin). Unpack the downloaded code. Since we don't have access to a cloud, we will use a "local" execution of Pig and no special configuration is needed, though Java 1.6.x or newer is required to be installed.

Running Pig

The following execution description assumes you are running Pig on a Linux system. First, be sure your JAVA_HOME environment variable is set. For instance in bash you would use.
   printenv | grep JAVA
to see the environment variables that start with JAVA. If you don't see it then define it. (The commands below are specific to bash and where Java is installed, moodify for your system.)
   JAVA_HOME = /usr/java/latest
   export JAVA_HOME
Next "cd" to the Pig directory. Then you can run the Pig shell with the following command.
   bin/pig -x local
You should see the following command prompt.
   grunt>
You are now ready to execute Pig statements.

Note that for this assignment you will put all of your commands in a single file which can be run as follows (collecting output in foo.txt.

   bin/pig -x local <script.pig >foo.txt

Loading the data

The three comma-separated data files to use for this assignment are actor.csv, movie.csv, and casting.csv. The schema (column ordering) for each file is listed below.
Movie (MovieID, Title, Year, Score, Votes)
Actor(ActorID, Name)
Casting (MovieID, ActorID, Ordinal):
In the Movie table, the Score is a measure of a movie's popularity as voted on by internet users. Votes is the number of votes cast. The Casting table relates actors with the movies in which they are cast. The Ordinal column is the cast 'billing order', e.g., the star actor in a movie has an ordinal of 1, the second leading star has an ordinal of 2, a bit player would have an ordinal of 80 (assuming at least 80 actors were in the movie).

The following Pig statement loads the actor data.

  ACTOR = LOAD 'actor.csv' USING PigStorage(',') AS (actorid:int, name:chararray);
It binds the data to the alias ACTOR.

Creating script.pig

Edit the file script.pig to create a single Pig script that does the following (in sequence).
  1. Load the data from the three csv files.
  2. Join the data.
  3. Dump the titles of movies that have a score higher than 8. Below is example output.
    ("2001: A Space Odyssey")
    ("Alien")
    ("Aliens")
    ("Apocalypse Now")
    ("Blade Runner")
    ("Braveheart")
    ("Casablanca")
    ("Clockwork Orange)
    ("Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb")
    ("Fargo")
    ("Monty Python and the Holy Grail")
    ("One Flew Over the Cuckoo's Nest")
    ("Pulp Fiction")
    ("Raiders of the Lost Ark")
    ("Reservoir Dogs")
    ("Se7en")
    ("Star Trek: First Contact")
    ("Star Wars")
    ("Terminator 2: 3-D")
    ("The Empire Strikes Back")
    ("The English Patient")
    ("The Godfather")
    ("The Princess Bride")
    ("The Shawshank Redemption")
    ("The Silence of the Lambs")
    ("The Usual Suspects")
    ("Titanic")
    ("Trainspotting")
    
  4. Dump the title and year of movies made in even years in the 1980s.
    ("Blade Runner",1982)
    ("The Empire Strikes Back",1980)
    ("Aliens",1986)
    ("The Terminator",1984)
    ("Twins",1988)
    ("Conan the Destroyer",1984)
    ("The Money Pit",1986)
    ("Red Heat",1988)
    ("Night Shift",1982)
    ("Raw Deal",1986)
    
  5. Dump the top 5 highest vote-getting movies (use ORDER and LIMIT).
    ("Star Wars",14182)
    ("Pulp Fiction",11693)
    ("Blade Runner",8665)
    ("Titanic",8129)
    ("Braveheart",8074)
    
  6. Dump the names of the actors that have been cast in more than three movies.
    ("Arnold Schwarzenegger")
    ("Harrison Ford")
    ("Shelley Long")
    
  7. For the actors that have been cast in more than three movies, dump their name and their average score.
    ("Arnold Schwarzenegger",6.390000033378601)
    ("Harrison Ford",8.539999961853027)
    ("Shelley Long",6.200000047683716)
    
Be sure to add comments to your file to illustrate each step, for example.
/* Load the Actor data */
ACTOR = LOAD 'actor.csv' USING PigStorage(',') AS (actorid:int, name:chararray);
/* Load the Movie data */
...
                                                                                                                                     
 
E-mail questions or comments to or Curtis dot Dyreson at usu dot edu