Open Sahara is an open source framework for text mining, developed by TalkingTrends. Open Sahara provides scalable functionality for harvesting and annotating content, natural language processing, semantic indexing, storage and searching. It integrates GATE (General Architecture for Text Engineering), Heritrix 3 (the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project), and any Semantic Database based on Sesame into a powerful platform for Natural Language Processing, Semantic indexing, Text-mining and Data-mining.

The Open Sahara framework is open source under the permissive Apache V2 Licence. TalkingTrends, the company behind Open Sahara, also develops closed source extensions for this platform that focus on:
  • Cloud-based (horizontal) scalability of storage and indexing.
  • Cloud-based scalability of NLP pipelines.
  • Processing of text written in Dutch.
  • Harvesting of Named Entities (locations, persons, etc.)
  • Geospatial knowledge extraction.

Get Involved

You must register your account and activate it via an email (or login if you have an account) before you can report issues or participate in discussions.

You can help in many ways. Pick what suits your skills best, all contributions are much appreciated by the rest of the community. More information on how to go about contributing is available on our GetInvolved wiki page.

Issue tracking

View all issues