A solid, secure, and scalable systems architecture
Dcipher Analytics runs in Google Cloud and users interact with the web application using a browser. The architecture is modularized with microservices running in Kubernetes to allow load-balancing on multiple pods, providing high availability.
The Dcipher API manages access to storage and database units while ensuring high security. It handles all user task requests and is designed to serve millions of users. Dcipher's Message Queue allows the transaction of large-size data between the web client and computation clusters. Our resource controllers dynamically allocate computation and storage resources based on needs.
Dcipher Analytics offers fully distributed computation and complete parallelism among machines on top of an Apache Spark framework.
All datasets are cached in the memory (RAM) for easy access and processing. The cached operations and data require no recalculations during the session. Thus, unlike most pipeline-based analytics platforms, each operation runs only once and the user can safely add or remove operations to and from the pipeline. When operation parameters are changed, only downstream operations need to be rerun rather than the entire pipeline as with conventional solutions.
Overall, the Dcipher Analytics architecture allows scaling computation resources according to the need and provides easy integration with upstream resources such as data and pre-trained models.
The Dcipher Analytics Query Language and Visual ORM
Text data does not fit well into flat, tabular data structures. Imagine, for example, a dataset containing articles that have been split into paragraphs, which have then been split into sentences, which have in turn been tokenized into words – and where these words have been tagged with part-of-speech. To avoid the gloomy prospect of having to store this data in multiple tables and constructing complex queries to combine them, such data is best stored in nested formats enabled by NoSQL databases.
But in addition to the aggregations, filters, and joins offered by classic database query languages, text analytics requires data transformations as well (in the example above, splitting of text into paragraphs and sentences, tokenization, and part-of-speech tagging). Moreover, to keep computation fast, data should be kept in memory to avoid a large number of instances of reading, writing, and indexing. To solve these issues and enable queries of nested data that incorporate custom transformations, we've created the Dcipher Analytics Query Language (DAQL). It allows complex analytics functions with database-like queries on unstructured data, greatly increasing speed and flexibility while enabling the Dcipher developer team to make sure the latest NLP algorithms are quickly incorporated into the platform.
Users of Dcipher Analytics do not need to write DAQL queries to run text analytics operations. Instead, they unwittingly construct DAQL queries through simple visual interactions like dragging data from one workbench to another. Thus, Dcipher offers visual object-relational mapping (ORM). Or put more plainly: a visual language for advanced text analytics.
The latest AI frameworks
Dcipher Analytics' machine learning functionality builds on state-of-the-art word embedding techniques, including BERT, GloVe, ELMo, Flair, and fastText transformers. These power unsupervised tasks (like clustering of words and documents based on their semantic similarities) as well as semi-supervised tasks (like semi-automatic annotation) and supervised tasks (like sentiment analysis).
To accelerate the process of annotating and training classifiers on text data, Dcipher leverages Active Learning, a semi-supervised machine learning technique where the user is queried for labels for the most informative examples, resulting in efficient learning of concepts with relatively few examples. Dcipher's visual interface enables rapid iterations through an efficient loop between machine (providing informative examples) and user (providing labels for those examples), making annotation and model training possible in minutes or hours rather than days or weeks.
Extensive language coverage and innovative NLP functionality
Natural Language Processing tools available commercially typically cover only a small number of languages. In Dcipher Analytics, language coverage is extensive: tokenization is available for 185 languages, sentiment analysis for 115 languages, concept detection for 45 languages, lemmatization for 27 languages, named entity recognition for 39 languages, and topic and phrase detection for any language. Domain-specific language models are added upon user request. Given sufficient data, Dcipher enables users to train vectorizers and custom text classifiers for any language and domain. The extensive language coverage is the result of combining the latest open-source models with proprietary algorithms.
Multi-language text analysis, cumbersome with previous approaches, is as easy as analyzing text in a single language with Dcipher. Traditionally, the choice has been between machine-translating the raw text (expensive) and running analyses on the text in different languages separately (time-consuming). In addition to text-level auto-translation, Dcipher provides the option to output tokens and text enrichment labels, such as concepts, in English regardless of the language of the input text. This relies on lexicon-based word translation in combination with context-based disambiguation to find the most likely English translation. The result is the ability to analyze the text in more than 45 languages simultaneously, without the need for prior text-level translation.
A classic text preprocessing dilemma relates to splitting text into relevant segments. It's easy to split the text into sentences and paragraphs, but meaningful segments can span several of these units or be contained within them. Being able to identify relevant segments is particularly important in the text where newlines are missing or paragraphs are long. To solve this issue, we've developed a smart segmentation operation that looks for semantic similarity as well as references between adjacent sentences to split the text into coherent segments.
Natural languages are rich and ambiguous. This makes it near-to-impossible to capture what is relevant and eliminating what is not through keyword-based filters. Dcipher's semantic filtering solves this by clustering semantically similar texts and allowing the user to interpret and apply filters on these clusters. This type of filtering by meaning rather than keywords is possible through UMAP projections of document vectors onto a two-dimensional surface, enabling human interpretation. Apart from filtering, such document landscapes are highly useful for exploratory text analysis as well.
These are only some of the features that set Dcipher Analytics apart. For a more complete picture, see this article on Dcipher's feature set.