Homework #4 (as mid-term exam)

PURPOSE

Replicate the results of Zipf's Law.

The homework is adapted from Exercise 1.2 in Foundations of Statistical Natural Language Processing.

DATA SOURCE

While in FSNLP the result is done on Tom Sawyer, you are required now to replicate similar results on Chinese text selected from Academia Sinica balanced corpus (ASBC), v3.0.

PROCEDURE

  1. Look carefully into the format of ASBC text. (If you're using UltraEdit, you may use the [Edit/Hex Edit] function to help you dig into the binary format.)
  2. Write a Perl program to extract/count the occurrence of each word.
  3. Copy/Paste the results of (2) into Excel, and sort them.
  4. Draw charts in Excel.
  5. Review what you've done.

SCORING

Please hand in the following documents (and compress them in a single archive if possible):

  1. 40%    Perl program.
  2. 35%    Excel file with data and charts in it.
  3. 25%    A short and concise plain-text document describing what difficulties you've encountered, what interesting ideas you've learned, etc. while you're undertaking this task.