Download Pre-trained Word Vectors
Oscova has an in-built Word Vector loader that can load Word Vectors from large vector data files generated by either GloVe
, Word2Vec
or fastText
model.
During development if you do not have a domain-specific data to train you can download any of the following pre-trained models. We have gathered the following from various external sources and are making the links available here. All the following word vector data in text file format should be compatible with Syn.Bot
framework version 2.6.0 and above.
Note
The Word Vector loader in Oscova has been heavily optimized for performance and can load 10,000 Word Vectors per second or more depending on your machine configuration.
FastText Models
Below are pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. A word vector model developed by Facebook research team. The vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.
Note
We recommend loading these word vectors only during production and not during development as it may several minutes for entire vector data to be loaded and can hinder fast prototyping.
All word vector data loaded in Oscova are utilized in both Standard
and Deep
processing mode.
Format
The word vectors below are in binary and text formats of fastText, where each line contain a word followed by its embedding. Each value is space separated. Words are ordered by their frequency in a descending order.
These fastText
models are directly provided by the team behind the fastText model. You can visit the fastText docs page here for more information on fastText models.
Models
The models can be downloaded from:
Afrikaans: bin, text | Albanian: bin, text | Alemannic: bin, text |
Amharic: bin, text | Arabic: bin, text | Aragonese: bin, text |
Armenian: bin, text | Assamese: bin, text | Asturian: bin, text |
Azerbaijani: bin, text | Bashkir: bin, text | Basque: bin, text |
Bavarian: bin, text | Belarusian: bin, text | Bengali: bin, text |
Bihari: bin, text | Bishnupriya Manipuri: bin, text | Bosnian: bin, text |
Breton: bin, text | Bulgarian: bin, text | Burmese: bin, text |
Catalan: bin, text | Cebuano: bin, text | Central Bicolano: bin, text |
Chechen: bin, text | Chinese: bin, text | Chuvash: bin, text |
Corsican: bin, text | Croatian: bin, text | Czech: bin, text |
Danish: bin, text | Divehi: bin, text | Dutch: bin, text |
Eastern Punjabi: bin, text | Egyptian Arabic: bin, text | Emilian-Romagnol: bin, text |
English: bin, text | Erzya: bin, text | Esperanto: bin, text |
Estonian: bin, text | Fiji Hindi: bin, text | Finnish: bin, text |
French: bin, text | Galician: bin, text | Georgian: bin, text |
German: bin, text | Goan Konkani: bin, text | Greek: bin, text |
Gujarati: bin, text | Haitian: bin, text | Hebrew: bin, text |
Hill Mari: bin, text | Hindi: bin, text | Hungarian: bin, text |
Icelandic: bin, text | Ido: bin, text | Ilokano: bin, text |
Indonesian: bin, text | Interlingua: bin, text | Irish: bin, text |
Italian: bin, text | Japanese: bin, text | Javanese: bin, text |
Kannada: bin, text | Kapampangan: bin, text | Kazakh: bin, text |
Khmer: bin, text | Kirghiz: bin, text | Korean: bin, text |
Kurdish (Kurmanji): bin, text | Kurdish (Sorani): bin, text | Latin: bin, text |
Latvian: bin, text | Limburgish: bin, text | Lithuanian: bin, text |
Lombard: bin, text | Low Saxon: bin, text | Luxembourgish: bin, text |
Macedonian: bin, text | Maithili: bin, text | Malagasy: bin, text |
Malay: bin, text | Malayalam: bin, text | Maltese: bin, text |
Manx: bin, text | Marathi: bin, text | Mazandarani: bin, text |
Meadow Mari: bin, text | Minangkabau: bin, text | Mingrelian: bin, text |
Mirandese: bin, text | Mongolian: bin, text | Nahuatl: bin, text |
Neapolitan: bin, text | Nepali: bin, text | Newar: bin, text |
North Frisian: bin, text | Northern Sotho: bin, text | Norwegian (Bokmål): bin, text |
Norwegian (Nynorsk): bin, text | Occitan: bin, text | Oriya: bin, text |
Ossetian: bin, text | Palatinate German: bin, text | Pashto: bin, text |
Persian: bin, text | Piedmontese: bin, text | Polish: bin, text |
Portuguese: bin, text | Quechua: bin, text | Romanian: bin, text |
Romansh: bin, text | Russian: bin, text | Sakha: bin, text |
Sanskrit: bin, text | Sardinian: bin, text | Scots: bin, text |
Scottish Gaelic: bin, text | Serbian: bin, text | Serbo-Croatian: bin, text |
Sicilian: bin, text | Sindhi: bin, text | Sinhalese: bin, text |
Slovak: bin, text | Slovenian: bin, text | Somali: bin, text |
Southern Azerbaijani: bin, text | Spanish: bin, text | Sundanese: bin, text |
Swahili: bin, text | Swedish: bin, text | Tagalog: bin, text |
Tajik: bin, text | Tamil: bin, text | Tatar: bin, text |
Telugu: bin, text | Thai: bin, text | Tibetan: bin, text |
Turkish: bin, text | Turkmen: bin, text | Ukrainian: bin, text |
Upper Sorbian: bin, text | Urdu: bin, text | Uyghur: bin, text |
Uzbek: bin, text | Venetian: bin, text | Vietnamese: bin, text |
Volapük: bin, text | Walloon: bin, text | Waray: bin, text |
Welsh: bin, text | West Flemish: bin, text | West Frisian: bin, text |
Western Punjabi: bin, text | Yiddish: bin, text | Yoruba: bin, text |
Zazaki: bin, text | Zeelandic: bin, text |
References
If you use these word embeddings, please cite the following paper:
P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
License
The pre-trained word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.
Word2Vec and GloVe Models
Model file | Number of dimensions | Corpus (size) | Vocabulary size | Author | Architecture |
---|---|---|---|---|---|
Google News | 300 | Google News (100B) | 3M | word2vec | |
Freebase IDs | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | |
Freebase names | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | |
Wikipedia+Gigaword 5 | 50 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe |
Wikipedia+Gigaword 5 | 100 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe |
Wikipedia+Gigaword 5 | 200 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe |
Wikipedia+Gigaword 5 | 300 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe |
Common Crawl 42B | 300 | Common Crawl (42B) | 1.9M | GloVe | GloVe |
Common Crawl 840B | 300 | Common Crawl (840B) | 2.2M | GloVe | GloVe |
Twitter (2B Tweets) | 25 | Twitter (27B) | ? | GloVe | GloVe |
Twitter (2B Tweets) | 50 | Twitter (27B) | ? | GloVe | GloVe |
Twitter (2B Tweets) | 100 | Twitter (27B) | ? | GloVe | GloVe |
Twitter (2B Tweets) | 200 | Twitter (27B) | ? | GloVe | GloVe |
Wikipedia dependency | 300 | Wikipedia (?) | 174,015 | Levy & Goldberg | word2vec modified |
DBPedia vectors (wiki2vec) | 1000 | Wikipedia (?) | ? | Idio | word2vec |