Download Pre-trained Word Vectors

Oscova has an in-built Word Vector loader that can load Word Vectors from large vector data files generated by either GloVe, Word2Vec or fastText model.

During development if you do not have a domain-specific data to train you can download any of the following pre-trained models. We have gathered the following from various external sources and are making the links available here. All the following word vector data in text file format should be compatible with Syn.Bot framework version 2.6.0 and above.

Note

The Word Vector loader in Oscova has been heavily optimized for performance and can load 10,000 Word Vectors per second or more depending on your machine configuration.

FastText Models

Below are pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. A word vector model developed by Facebook research team. The vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.

Note

We recommend loading these word vectors only during production and not during development as it may several minutes for entire vector data to be loaded and can hinder fast prototyping.

All word vector data loaded in Oscova are utilized in both Standard and Deep processing mode.

Format

The word vectors below are in binary and text formats of fastText, where each line contain a word followed by its embedding. Each value is space separated. Words are ordered by their frequency in a descending order.

These fastText models are directly provided by the team behind the fastText model. You can visit the fastText docs page here for more information on fastText models.

Models

The models can be downloaded from:


Afrikaans: bin, text	Albanian: bin, text	Alemannic: bin, text
Amharic: bin, text	Arabic: bin, text	Aragonese: bin, text
Armenian: bin, text	Assamese: bin, text	Asturian: bin, text
Azerbaijani: bin, text	Bashkir: bin, text	Basque: bin, text
Bavarian: bin, text	Belarusian: bin, text	Bengali: bin, text
Bihari: bin, text	Bishnupriya Manipuri: bin, text	Bosnian: bin, text
Breton: bin, text	Bulgarian: bin, text	Burmese: bin, text
Catalan: bin, text	Cebuano: bin, text	Central Bicolano: bin, text
Chechen: bin, text	Chinese: bin, text	Chuvash: bin, text
Corsican: bin, text	Croatian: bin, text	Czech: bin, text
Danish: bin, text	Divehi: bin, text	Dutch: bin, text
Eastern Punjabi: bin, text	Egyptian Arabic: bin, text	Emilian-Romagnol: bin, text
English: bin, text	Erzya: bin, text	Esperanto: bin, text
Estonian: bin, text	Fiji Hindi: bin, text	Finnish: bin, text
French: bin, text	Galician: bin, text	Georgian: bin, text
German: bin, text	Goan Konkani: bin, text	Greek: bin, text
Gujarati: bin, text	Haitian: bin, text	Hebrew: bin, text
Hill Mari: bin, text	Hindi: bin, text	Hungarian: bin, text
Icelandic: bin, text	Ido: bin, text	Ilokano: bin, text
Indonesian: bin, text	Interlingua: bin, text	Irish: bin, text
Italian: bin, text	Japanese: bin, text	Javanese: bin, text
Kannada: bin, text	Kapampangan: bin, text	Kazakh: bin, text
Khmer: bin, text	Kirghiz: bin, text	Korean: bin, text
Kurdish (Kurmanji): bin, text	Kurdish (Sorani): bin, text	Latin: bin, text
Latvian: bin, text	Limburgish: bin, text	Lithuanian: bin, text
Lombard: bin, text	Low Saxon: bin, text	Luxembourgish: bin, text
Macedonian: bin, text	Maithili: bin, text	Malagasy: bin, text
Malay: bin, text	Malayalam: bin, text	Maltese: bin, text
Manx: bin, text	Marathi: bin, text	Mazandarani: bin, text
Meadow Mari: bin, text	Minangkabau: bin, text	Mingrelian: bin, text
Mirandese: bin, text	Mongolian: bin, text	Nahuatl: bin, text
Neapolitan: bin, text	Nepali: bin, text	Newar: bin, text
North Frisian: bin, text	Northern Sotho: bin, text	Norwegian (Bokmål): bin, text
Norwegian (Nynorsk): bin, text	Occitan: bin, text	Oriya: bin, text
Ossetian: bin, text	Palatinate German: bin, text	Pashto: bin, text
Persian: bin, text	Piedmontese: bin, text	Polish: bin, text
Portuguese: bin, text	Quechua: bin, text	Romanian: bin, text
Romansh: bin, text	Russian: bin, text	Sakha: bin, text
Sanskrit: bin, text	Sardinian: bin, text	Scots: bin, text
Scottish Gaelic: bin, text	Serbian: bin, text	Serbo-Croatian: bin, text
Sicilian: bin, text	Sindhi: bin, text	Sinhalese: bin, text
Slovak: bin, text	Slovenian: bin, text	Somali: bin, text
Southern Azerbaijani: bin, text	Spanish: bin, text	Sundanese: bin, text
Swahili: bin, text	Swedish: bin, text	Tagalog: bin, text
Tajik: bin, text	Tamil: bin, text	Tatar: bin, text
Telugu: bin, text	Thai: bin, text	Tibetan: bin, text
Turkish: bin, text	Turkmen: bin, text	Ukrainian: bin, text
Upper Sorbian: bin, text	Urdu: bin, text	Uyghur: bin, text
Uzbek: bin, text	Venetian: bin, text	Vietnamese: bin, text
Volapük: bin, text	Walloon: bin, text	Waray: bin, text
Welsh: bin, text	West Flemish: bin, text	West Frisian: bin, text
Western Punjabi: bin, text	Yiddish: bin, text	Yoruba: bin, text
Zazaki: bin, text	Zeelandic: bin, text

References

If you use these word embeddings, please cite the following paper:

P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

License

The pre-trained word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

Word2Vec and GloVe Models

Model file	Number of dimensions	Corpus (size)	Vocabulary size	Author	Architecture
Google News	300	Google News (100B)	3M	Google	word2vec
Freebase IDs	1000	Gooogle News (100B)	1.4M	Google	word2vec, skip-gram
Freebase names	1000	Gooogle News (100B)	1.4M	Google	word2vec, skip-gram
Wikipedia+Gigaword 5	50	Wikipedia+Gigaword 5 (6B)	400,000	GloVe	GloVe
Wikipedia+Gigaword 5	100	Wikipedia+Gigaword 5 (6B)	400,000	GloVe	GloVe
Wikipedia+Gigaword 5	200	Wikipedia+Gigaword 5 (6B)	400,000	GloVe	GloVe
Wikipedia+Gigaword 5	300	Wikipedia+Gigaword 5 (6B)	400,000	GloVe	GloVe
Common Crawl 42B	300	Common Crawl (42B)	1.9M	GloVe	GloVe
Common Crawl 840B	300	Common Crawl (840B)	2.2M	GloVe	GloVe
Twitter (2B Tweets)	25	Twitter (27B)	?	GloVe	GloVe
Twitter (2B Tweets)	50	Twitter (27B)	?	GloVe	GloVe
Twitter (2B Tweets)	100	Twitter (27B)	?	GloVe	GloVe
Twitter (2B Tweets)	200	Twitter (27B)	?	GloVe	GloVe
Wikipedia dependency	300	Wikipedia (?)	174,015	Levy & Goldberg	word2vec modified
DBPedia vectors (wiki2vec)	1000	Wikipedia (?)	?	Idio	word2vec