Search Results for

    Show / Hide Table of Contents

    Download Pre-trained Word Vectors

    Oscova has an in-built Word Vector loader that can load Word Vectors from large vector data files generated by either GloVe, Word2Vec or fastText model.

    During development if you do not have a domain-specific data to train you can download any of the following pre-trained models. We have gathered the following from various external sources and are making the links available here. All the following word vector data in text file format should be compatible with Syn.Bot framework version 2.6.0 and above.

    Note

    The Word Vector loader in Oscova has been heavily optimized for performance and can load 10,000 Word Vectors per second or more depending on your machine configuration.

    FastText Models

    Below are pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. A word vector model developed by Facebook research team. The vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.

    Note

    We recommend loading these word vectors only during production and not during development as it may several minutes for entire vector data to be loaded and can hinder fast prototyping.

    All word vector data loaded in Oscova are utilized in both Standard and Deep processing mode.

    Format

    The word vectors below are in binary and text formats of fastText, where each line contain a word followed by its embedding. Each value is space separated. Words are ordered by their frequency in a descending order.

    These fastText models are directly provided by the team behind the fastText model. You can visit the fastText docs page here for more information on fastText models.

    Models

    The models can be downloaded from:

    Afrikaans: bin, text Albanian: bin, text Alemannic: bin, text
    Amharic: bin, text Arabic: bin, text Aragonese: bin, text
    Armenian: bin, text Assamese: bin, text Asturian: bin, text
    Azerbaijani: bin, text Bashkir: bin, text Basque: bin, text
    Bavarian: bin, text Belarusian: bin, text Bengali: bin, text
    Bihari: bin, text Bishnupriya Manipuri: bin, text Bosnian: bin, text
    Breton: bin, text Bulgarian: bin, text Burmese: bin, text
    Catalan: bin, text Cebuano: bin, text Central Bicolano: bin, text
    Chechen: bin, text Chinese: bin, text Chuvash: bin, text
    Corsican: bin, text Croatian: bin, text Czech: bin, text
    Danish: bin, text Divehi: bin, text Dutch: bin, text
    Eastern Punjabi: bin, text Egyptian Arabic: bin, text Emilian-Romagnol: bin, text
    English: bin, text Erzya: bin, text Esperanto: bin, text
    Estonian: bin, text Fiji Hindi: bin, text Finnish: bin, text
    French: bin, text Galician: bin, text Georgian: bin, text
    German: bin, text Goan Konkani: bin, text Greek: bin, text
    Gujarati: bin, text Haitian: bin, text Hebrew: bin, text
    Hill Mari: bin, text Hindi: bin, text Hungarian: bin, text
    Icelandic: bin, text Ido: bin, text Ilokano: bin, text
    Indonesian: bin, text Interlingua: bin, text Irish: bin, text
    Italian: bin, text Japanese: bin, text Javanese: bin, text
    Kannada: bin, text Kapampangan: bin, text Kazakh: bin, text
    Khmer: bin, text Kirghiz: bin, text Korean: bin, text
    Kurdish (Kurmanji): bin, text Kurdish (Sorani): bin, text Latin: bin, text
    Latvian: bin, text Limburgish: bin, text Lithuanian: bin, text
    Lombard: bin, text Low Saxon: bin, text Luxembourgish: bin, text
    Macedonian: bin, text Maithili: bin, text Malagasy: bin, text
    Malay: bin, text Malayalam: bin, text Maltese: bin, text
    Manx: bin, text Marathi: bin, text Mazandarani: bin, text
    Meadow Mari: bin, text Minangkabau: bin, text Mingrelian: bin, text
    Mirandese: bin, text Mongolian: bin, text Nahuatl: bin, text
    Neapolitan: bin, text Nepali: bin, text Newar: bin, text
    North Frisian: bin, text Northern Sotho: bin, text Norwegian (Bokmål): bin, text
    Norwegian (Nynorsk): bin, text Occitan: bin, text Oriya: bin, text
    Ossetian: bin, text Palatinate German: bin, text Pashto: bin, text
    Persian: bin, text Piedmontese: bin, text Polish: bin, text
    Portuguese: bin, text Quechua: bin, text Romanian: bin, text
    Romansh: bin, text Russian: bin, text Sakha: bin, text
    Sanskrit: bin, text Sardinian: bin, text Scots: bin, text
    Scottish Gaelic: bin, text Serbian: bin, text Serbo-Croatian: bin, text
    Sicilian: bin, text Sindhi: bin, text Sinhalese: bin, text
    Slovak: bin, text Slovenian: bin, text Somali: bin, text
    Southern Azerbaijani: bin, text Spanish: bin, text Sundanese: bin, text
    Swahili: bin, text Swedish: bin, text Tagalog: bin, text
    Tajik: bin, text Tamil: bin, text Tatar: bin, text
    Telugu: bin, text Thai: bin, text Tibetan: bin, text
    Turkish: bin, text Turkmen: bin, text Ukrainian: bin, text
    Upper Sorbian: bin, text Urdu: bin, text Uyghur: bin, text
    Uzbek: bin, text Venetian: bin, text Vietnamese: bin, text
    Volapük: bin, text Walloon: bin, text Waray: bin, text
    Welsh: bin, text West Flemish: bin, text West Frisian: bin, text
    Western Punjabi: bin, text Yiddish: bin, text Yoruba: bin, text
    Zazaki: bin, text Zeelandic: bin, text

    References

    If you use these word embeddings, please cite the following paper:

    P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

    @article{bojanowski2016enriching,
      title={Enriching Word Vectors with Subword Information},
      author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
      journal={arXiv preprint arXiv:1607.04606},
      year={2016}
    }
    

    License

    The pre-trained word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

    Word2Vec and GloVe Models

    Model file Number of dimensions Corpus (size) Vocabulary size Author Architecture
    Google News 300 Google News (100B) 3M Google word2vec
    Freebase IDs 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram
    Freebase names 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram
    Wikipedia+Gigaword 5 50 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe
    Wikipedia+Gigaword 5 100 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe
    Wikipedia+Gigaword 5 200 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe
    Wikipedia+Gigaword 5 300 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe
    Common Crawl 42B 300 Common Crawl (42B) 1.9M GloVe GloVe
    Common Crawl 840B 300 Common Crawl (840B) 2.2M GloVe GloVe
    Twitter (2B Tweets) 25 Twitter (27B) ? GloVe GloVe
    Twitter (2B Tweets) 50 Twitter (27B) ? GloVe GloVe
    Twitter (2B Tweets) 100 Twitter (27B) ? GloVe GloVe
    Twitter (2B Tweets) 200 Twitter (27B) ? GloVe GloVe
    Wikipedia dependency 300 Wikipedia (?) 174,015 Levy & Goldberg word2vec modified
    DBPedia vectors (wiki2vec) 1000 Wikipedia (?) ? Idio word2vec
    Theme
    Back to top Copyright © 2011-2023 Synthetic Intelligence Network