# Wordlex #

WorldLex provides word frequencies tables for 64 languages, estimated from web pages (Blog, Twitter and Newspapers).

The web pages corpora were assembled by Hans Christensen and are available at [HC-Copora](http://corpora.epizy.com/index.html). According to this web site:

> The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language.
> Once the raw corpus has been collected, it is parsed further, to remove duplicate entries and split into individual lines. Approximately 50% of each entry is then deleted. Since you cannot fully recreate any entries, the entries are anonymised and this is a non-profit venture I believe that it would fall under Fair Use.

The frequencies tables were created by [Manuel Gimenes](https://sites.google.com/site/manuelgimeneshomepage/) & [Boris New](http://psycho-usmb.fr/boris.new/)

**Website:** <http://worldlex.lexique.org>

**Publication:**

Gimenes, Manuel, and Boris New. 2016. [Worldlex: Twitter and Blog Word Frequencies for 66 Languages.](https://drive.google.com/file/d/0B-sE9ac1ksCANWFVN3ZacHFWQ0k/view) _Behavior Research Methods_ 48 (3): 963–72. <https://doi.org/10.3758/s13428-015-0621-0>


----

Time-stamp: <2019-10-05 09:39:10 christophe@pallier.org>


