Google 釋出 n-gram 資料
2007 年 一月 4 日 (星期四) 2:24 pm分類:電腦
標籤:自然語言處理
從 Google Research Blog 看到一則舊聞:“All Our N-gram are Belong to You”。以下是關於這份 n-gram 的數據:
- Number of tokens: 1,024,908,267,229
- Number of sentences: 95,119,665,584
- Number of unigrams: 13,588,391
- Number of bigrams: 314,843,401
- Number of trigrams: 977,069,902
- Number of 4-grams: 1,313,818,354
- Number of 5-grams: 1,176,470,663
乖乖,壓縮過後的資料,整整 24 GB,難怪要用 6 片 DVD 才裝得下。
他們釋出這些資料的原因是:
We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together.
呃,中文資訊界也奮鬥了這麼多年了,怎麼都沒看到類似的資料釋出?


追蹤留言回應:以
引用通告 (trackback):![[add to funP]](http://william.cswiz.org/blog/wp-content/themes/william/images/add-funp.png)
![[add to HEMiDEMi]](http://www.hemidemi.com/sticker/user/roxytom.bluecircus.net.gif)
![[add to udn bookmark]](http://bookmark.udn.com/html/help/80_20_02.gif)

2007 年 二月 12日 於 12:37 am
Google 開放分析一萬億字所得的資料
早幾天說到如果有機械人漫遊網路學習文字的話,也許可以有無限的可能性。今天竟才由 William’s Blog 讀到一單舊聞《Google 釋出 n-gram 資料》,Google把它所得的資料,分析N-gram後的結果: [...]