Homework #7 (as mid-term exam)

PURPOSE

Replicate the results of Zipf's Law.

The homework is adapted from Exercise 1.2 in Foundations of Statistical Natural Language Processing (FSNLP).

DATA SOURCE

While FSNLP demostrated the Zipf's Law by Tom Sawyer, you are required to replicate similar results on Chinese text selected from Academia Sinica balanced corpus (ASBC), v3.0.

PROCEDURE

  1. Look carefully into the format of ASBC text. (If you're using UltraEdit, you may use the [Edit/Hex Edit] function to help you dig into the binary format.)
  2. Write a Perl program to extract and count the occurrence of each word.
  3. Copy/Paste the results of (2) into Excel, and sort them.
  4. Draw statistical charts in Excel.
  5. Review what you've done.

Scoring

Please hand in the following documents (and compress them in a single archive if possible):

  1. 40%    Perl program.
  2. 35%    Excel file with data and charts in it.
  3. 25%    A short and concise document describing what difficulties you've encountered, what interesting ideas you've learned, etc. while you're undertaking this task.