Homework #7 (as mid-term exam)
PURPOSE
Replicate the results of Zipf's Law.
The homework is adapted from Exercise 1.2 in
Foundations of Statistical Natural Language Processing
(FSNLP).
DATA SOURCE
While FSNLP demostrated the Zipf's Law by Tom Sawyer,
you are required to replicate similar results on
Chinese text
selected from Academia Sinica balanced corpus (ASBC), v3.0.
PROCEDURE
- Look carefully into the format of ASBC text.
(If you're using UltraEdit, you may use the [Edit/Hex Edit]
function to help you dig into the binary format.)
- Write a Perl program to extract and count
the occurrence of each word.
- Copy/Paste the results of (2) into Excel, and sort them.
- Draw statistical charts in Excel.
- Review what you've done.
Scoring
Please hand in the following documents
(and compress them in a single archive if possible):
- 40% Perl program.
- 35% Excel file with data and charts in it.
- 25% A short and concise document
describing
what difficulties you've encountered,
what interesting ideas you've learned, etc.
while you're undertaking this task.