Open Sahara is an open source framework for text mining, developed by TalkingTrends. Open Sahara provides scalable functionality for harvesting and annotating content, natural language processing, semantic indexing, storage and searching. It integrates GATE (General Architecture for Text Engineering), Heritrix 3 (the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project), and any Semantic Database based on Sesame into a powerful platform for Natural Language Processing, Semantic indexing, Text-mining and Data-mining.

The Open Sahara framework is open source under the permissive Apache V2 Licence. TalkingTrends, the company behind Open Sahara, also develops closed source extensions for this platform that focus on:
  • Cloud-based (horizontal) scalability of storage and indexing.
  • Cloud-based scalability of NLP pipelines.
  • Processing of text written in Dutch.
  • Harvesting of Named Entities (locations, persons, etc.)
  • Geospatial knowledge extraction.

