Due date: .
Instructions: Upload your homework using the
homework turnin page.
Turn in a Pig script called script.pig that contains
the Pig code to peform the queries listed later in this assignment.
Pig
The following links have information on Pig.
Download the the
latest release of Pig.
Pig runs on Linux/Unix or Windows (with Cygwin).
Unpack the downloaded code.
Since we don't have access to a cloud, we will use a "local" execution of
Pig and no special configuration is needed, though Java 1.6.x or newer
is required to be installed.
Running Pig
The following execution description assumes you are running Pig on a
Linux system.
First, be sure your JAVA_HOME environment variable is set. For instance
in bash you would use.
printenv | grep JAVA
to see the environment variables that start with JAVA. If you don't see it
then define it. (The commands below are specific to bash and where Java is
installed, moodify for your system.)
JAVA_HOME = /usr/java/latest
export JAVA_HOME
Next "cd" to the Pig directory. Then you can run the Pig shell with the
following command.
bin/pig -x local
You should see the following command prompt.
grunt>
You are now ready to execute Pig statements.
Note that for this assignment you will put all of your commands in a
single file which can be run as follows (collecting output
in foo.txt .
bin/pig -x local <script.pig >foo.txt
Loading the data
The three comma-separated data files to use
for this assignment are
actor.csv ,
movie.csv , and
casting.csv .
The schema (column ordering) for each file is listed below.
Movie (MovieID, Title, Year, Score, Votes)
Actor(ActorID, Name)
Casting (MovieID, ActorID, Ordinal):
In the Movie table, the Score is a measure of a movie's popularity
as voted on by internet users. Votes is the number of votes cast.
The Casting table relates actors with the movies in which they are cast.
The Ordinal column is the cast 'billing order', e.g., the star actor in
a movie has an ordinal of 1, the second leading star has an ordinal
of 2, a bit player would have an ordinal of 80 (assuming at least
80 actors were in the movie).
The following Pig statement loads the actor data.
ACTOR = LOAD 'actor.csv' USING PigStorage(',') AS (actorid:int, name:chararray);
It binds the data to the alias ACTOR .
Creating script.pig
Edit the file script.pig to create a single Pig script that
does the following (in sequence).
-
Load the data from the three csv files.
-
Join the data.
-
Dump the titles of movies that have a score higher than 8.
Below is example output.
("2001: A Space Odyssey")
("Alien")
("Aliens")
("Apocalypse Now")
("Blade Runner")
("Braveheart")
("Casablanca")
("Clockwork Orange)
("Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb")
("Fargo")
("Monty Python and the Holy Grail")
("One Flew Over the Cuckoo's Nest")
("Pulp Fiction")
("Raiders of the Lost Ark")
("Reservoir Dogs")
("Se7en")
("Star Trek: First Contact")
("Star Wars")
("Terminator 2: 3-D")
("The Empire Strikes Back")
("The English Patient")
("The Godfather")
("The Princess Bride")
("The Shawshank Redemption")
("The Silence of the Lambs")
("The Usual Suspects")
("Titanic")
("Trainspotting")
-
Dump the title and year of movies made in even years in the 1980s.
("Blade Runner",1982)
("The Empire Strikes Back",1980)
("Aliens",1986)
("The Terminator",1984)
("Twins",1988)
("Conan the Destroyer",1984)
("The Money Pit",1986)
("Red Heat",1988)
("Night Shift",1982)
("Raw Deal",1986)
-
Dump the top 5 highest vote-getting movies (use ORDER and LIMIT).
("Star Wars",14182)
("Pulp Fiction",11693)
("Blade Runner",8665)
("Titanic",8129)
("Braveheart",8074)
-
Dump the names of the actors that have been cast in more than
three movies.
("Arnold Schwarzenegger")
("Harrison Ford")
("Shelley Long")
-
For the actors that have been cast in more than three movies, dump
their name and their average score.
("Arnold Schwarzenegger",6.390000033378601)
("Harrison Ford",8.539999961853027)
("Shelley Long",6.200000047683716)
Be sure to add comments to your file to illustrate each step, for
example.
/* Load the Actor data */
ACTOR = LOAD 'actor.csv' USING PigStorage(',') AS (actorid:int, name:chararray);
/* Load the Movie data */
...
|