# Open lexical databases # You will find below a directory of open lexical databases. Click on the name of any database to obtain more information and links to datasets. ## Français ## | Base | Description | |-------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [Lexique3](Lexique382/README-Lexique.md) | _Lexique3_ est une base de données lexicales du français qui fournit pour ~140000 mots du français: les représentations orthographiques et phonémiques, les lemmes associés, la syllabation, la catégorie grammaticale, le genre et le nombre, les fréquences dans un corpus de livres et dans un corpus de sous-titres de films, etc. | | [Anagrammes](anagrammes/README-anagrammes.md) | _Anagrammes_ liste plus de 25000 ensembles d'anagrammes du français. | | [Voisins](Voisins/README-Voisins.md) | _Voisins_ liste les voisins orthographiques par substitution d'une lettre pour 130000 mots français. | | [French Lexicon Project](FrenchLexiconProject/README-FrenchLexiconProject.md) | The _French Lexicon Project_ (FLP) was inspired from the _English Lexicon Project_ (Balota et al. 2007). It provides visual lexical decision time for about 39000 French words and as many pseudowords. The full data represents 1942000 reactions times from 975 participants. | | [Megalex](Megalex/README-Megalex.md) | _Megalex_ provides visual and auditory lexical decision times and accuracy rates several thousands of words: Visual lexical decision data are available for 28466 French words and the same number of pseudowords, and auditory lexical decision data are available for 17876 French words and the same number of pseudowords. | | [Chronolex](Chronolex/README-Chronolex.md) | _Chronolex_ provides naming times, lexical decision times and progressive demasking scores on most monosyllabic monomorphemic French (about 1500 items). Thirty-seven participants were tested in the naming task, 35 additionnal participants in the lexical decision task and 33 additionnal participants were tested in the progressive demasking task. | | [Brulex](Brulex/README-Brulex.md) | _Brulex_ donne, pour environ 36.000 mots de la langue française, l'orthographe, la prononciation, la classe grammaticale, le genre, le nombre et la fréquence d'usage. Il contient également d'autres informations utiles à la sélection de matériel expérimental (notamment, point d'unicité, comptage des voisins lexicaux, patrons phonologiques, fréquence moyenne des digrammes). | | [Gougenheim100](Gougenheim100/README-Gougenheim.md) | _Gougenheim100_ présente, pour 1064 mots, leur fréquence et leur répartition (nombre de textes dans lesquels ils apparaissent). Le corpus sur lequel, il est basé est un corpus de langue oral basé sur un ensembles d'entretiens avec 275 personnes. C'est donc non seulement un corpus de langue orale mais aussi de langue produite. Le corpus original comprend 163 textes, 312.135 mots et 7.995 lemmes différents. | | [Chacqfam](chacqfam/README-Chacqfam.md) | CHACQFAM est une base de données renseignant l’âge d’acquisition estimé et la familiarité de 1225 mots Français | | [Frantext](Frantext/README-Frantext.md) | _Frantext_ fournit la liste de tous les types orthographiques obtenus après tokenization du sous-corpus de Frantext utilisé pour calculer les fréquences "livres"" de Lexique. | | [francais-GUTenberg](Liste-de-mots-francais-Gutenberg/README-liste-francais-Gutenberg.md) | Liste de 336531 mots français obtenue à partir du dictionnaire ispell Français-GUTenberg | | [Morphalou](Morphalou/README-Morphalou.md) | Lexique à large couverture, comprenant 159 271 lemmes et 976 570 formes fléchies, du français moderne. | | [Morpholex-fr](Morpholex-fr/README-Morpholex-fr.md) | Lexical database for ~38k French words with morphological variables. | | [Fr- Familiary660](Robert-Dorot-Mathey/README-RobertDorotMathey2012.md) | Familiarités de 660 mots estimées par des adultes jeunes et des adultes âgés. | ## English (American and British) ## | Base | Description | |----------------------------------------------------------------|-------------------| | [SUBTLEX-US](SUBTLEX-US/README-SUBTLEXus.md) | _SUBTLEXus_ (Brysbaert, New & Keuleers, 2012) provides two frequency measures based on American movies subtitles (51 million words in total): a) The frequency per million words, called SUBTLEXWF (word form frequency) b) The percentage of films in which a word occurs, called SUBTLEXCD (contextual diversity) | | [British Lexicon Project](BritishLexiconProject/README-BritishLexiconProject) | The British Lexicon Project (Keuleers et al, 2012) contains lexical decision data for over 28,000 monosyllabic and disyllabic English words.. | | [English Lexicon Project](EnglishLexiconProject/README-ELP.md) | The English Lexicon Project provides a standardized behavioral and descriptive data set for 40,481 words and 40,481 nonwords. Data from 816 participants across six universities were collected in a lexical decision task (approximately 3400 responses per participant), and data from 444 participants were collected in a speeded naming task (approximately 2500 responses per participant) | | [Morpholex-en](Morpholex-en/README-Morpholex-en.md) | Lexical database for ~70k English words with morphological variables. | ## Chinese ## | Base | Description | |-----------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [SUBTLEX-CH](SUBTLEX-CH/README-subtlex-ch.md) | _SUBTLEX-CH_ (Cai & Brysbaert 2010) is a database of Chinese word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). | ## Multilingual ## | Base | Description | |-----------------------------------------|----------------------------------------------------------------------------------------| | [WorldLex](WorldLex/README-Worldlex.md) | Worldlex provides word frequencies estimated from web pages collected in 66 languages. | | [AoA-32lang](AoA-32lang/README-AoA-32lang.md) | AoA-32lang presents a set of subjective Age of Acquisition (AoA) ratings for 299 words (158 nouns, 141 verbs) in 32 languages. | ## Usage ## * Most datasets are provided in form of `.tsv` or `.csv` files (tab-separated-values or comma-separated-values). These are plain text files which can be easily imported in to R, MATLAB or Python, or even [opened with Excel](https://rievent.zendesk.com/hc/en-us/articles/360000029172-FAQ-How-do-I-open-a-tsv-file-in-Excel-). Check out our [script examples](../scripts/README.md). * Many of these databases can also be explored or queried on-line at , thanks to shiny apps from the [openlexicon project](http://github.com/chrplr/openlexicon). * Most databases have associated publications listed in their respective `README` files. They must be cited in any derivative work! ## Similar lists or resources ## - Marc Brysbaert's web site at - Meiryum Al's [Best 25 Datasets for Natural Language Processing](https://gengo.ai/datasets/the-best-25-datasets-for-natural-language-processing/) ## Contributing ## If you want to contribute, check out the [OpenLexicon project](http://chrplr.github.io/openlexicon) Time-stamp: <2019-05-01 11:24:52 christophe@pallier.org>