Artificial Intelligence stands for the description of the technologies used (see Machine Learning). The term ‘AI’ is primarily a marketing term. The models generated by machine learning are not actually ‘intelligent’, but solve narrowly defined problems using complex statistical calculations.
An annotation is a short note or comment added to a text, image, or another document. In linguistics, this is mostly manual extraction of certain features of natural language in texts. Example: determining the gender of proper names by marking individual names in texts with labels such as ‘female’, ‘diverse’, ‘neutral’, etc.
automatic text generation
Automatic text generation software is increasingly used to create content for websites and other online applications. This software often uses a combination of machine learning and linguistic algorithms to create text that matches a specific pattern or template. This requires structured data
Classification describes the task of assigning a data object to one of several previously defined classes. Example: assigning a text to a genre based on its content.
The grouping of data: data within a group should be similar to each other, data in different groups should be different. Such a group is called a cluster. Example: a lot of international soccer clubs can be clustered according to their membership in a national league.
Computational linguistics is the interface between computer science and linguistics. Natural languages (both text and audio) should be processed using computers, such as speech recognition and synthesis, machine translation, and dialog systems.
Content marketing is a strategic approach to creating and distributing valuable and relevant content to attract and retain users. The goal is to generate new Internet users and retain existing users. Content marketing is thus a tool for increasing traffic. The content can have different formats, e.g. in the form of blog articles on one’s own website, in videos, podcasts or infographics.
A collection of texts that usually has a context in terms of content or structure. For example, a corpus may consist of texts from one source.
A crawler is a program that extracts data from a web page and writes these into a database.
Interactive visualization of data. Example: the user can change the time period of the displayed data or zoom into a diagram to view something in more detail.
Data mining refers to the computer-aided evaluation and analysis of large volumes of data with the aim of gaining new insights. Automated procedures for pattern recognition, methods from the field of artificial intelligence, statistics and data analysis are used.
Deep Learning has greatly improved the results of ML in many areas, but it is also much more resource intensive than ML methods that are not ‘deep’, i.e. neural networks with only one layer, or other algorithms that do not use neural networks at all.
Entity extraction or entity recognition is a process in which individual, uniquely identifiable entities such as people, places, things, terms, etc. are extracted from unstructured or semi-structured digital data and stored in a machine-readable format.
Common model (pretrained) for text generation developed by OpenAI. GPT-2 is open source.
Further development of GPT-2 (see above). Both models use approximately the same architecture, but GPT-3 has more layers (see neural network) and has been trained with more data.
There are language models for all areas of computational linguistics. Besides text generation, these are for example speech recognition, handwriting recognition, information recognition and extraction.
The following types of language models are used by Ella:
Sequence-to-sequence language model: This is a type of model used in natural language processing (NLP) where the input and output are both sequences of words or tokens. It’s commonly used for tasks like machine translation, where the model takes a sequence in one language and generates a sequence in another language.
BERT-Variant language model: BERT stands for “Bidirectional Encoder Representations from Transformers”. A BERT-variant language model is a model that is based on the BERT architecture but might have some modifications or improvements, such as different training data, model size, or downstream task fine-tuning.
Large language model (LLM): This refers to a type of neural network-based language model that is designed to understand and generate human language. “Large” indicates that the model has a high number of parameters (weights and connections) in its architecture. These models are capable of performing a wide range of NLP tasks and often require significant computational resources for training and inference.
Machine learning refers to a process in which computers learn on their own without being programmed to do so for each use case. These are technologies in which computer programs process a large number of examples, derive patterns and apply them to new data points. In the process, a statistical model is built from the examples to build language, for instance. Deep learning is a variant of machine learning. AI is a buzzword for Machine Learning.
Predefined measurement value to indicate quality in relation to a specific criterion.
The sequence of all procedures used to build the model. Neural networks are usually described by the number and function of their layers. A model is created from the combination of architecture and corpus.
Morphology means the study of forms and is a branch of linguistics. It is the science of the change of word forms in a language. Words are not fixed entities and can change their form. Depending on the context, for example, write becomes writes or run becomes runs.
Named Entity Recognition (NER)
The automatic detection and labeling of proper names (entities) in texts. Example: Angela Merkel and Frau Merkel refer to the same person in two sentences.
Natural Language Generation (NLG)
Natural Language Generation. The generation of text (natural language) using machine learning.
Natural Language Processing (NLP)
Natural Language Processing deals with the automatic processing of natural language. Methods from computational linguistics, artificial intelligence and statistics are used to recognize, understand, interpret and generate language. This knowledge can then be used, for example, to translate or rewrite texts.
Natural Language Understanding (NLU)
Natural Language Understanding describes the ability of machines to understand natural language. This includes both reading and writing natural language and analyzing meaning and context.
Neural networks are artificial intelligence models modeled on the human brain. They consist of a series of processing units that are interconnected.
An artificial neural network consists of layers of interconnected units (neurons) that pass information to each other under certain conditions. Each unit processes a specific signal and passes it on to the next unit. The network learns by processing signals and adjusting the connections between units.
normalization of texts
Normalization of texts describes the standardization of text structure and punctuation. For example, all quotation marks and dashes are normalized to one character each, as are section markers such as lines or markers for chapter headings.
In computer science, ontology is the formalization of a field of knowledge to describe complex facts in a machine-readable form. In doing so, a certain structure of objects and their relationships is specified so that computer software can process them. Ontologies are used in computer science mainly for semantic procedures. Here, one is attempting to extract and to model knowledge from an unstructured data stock with the help of machine learning procedures. This way, complex inquiries can for example also be made to a data stock.
Before a corpus can be passed to AI for training, some preprocessing steps have to be performed. Unwanted content, for example, is removed, normalizations are performed, and texts are adapted to model specifics. If for instance a model has only learned one type of quotation marks in pretraining, these quotation marks should be the same in the training corpus for finetuning, so that they are recognized correctly right away.
pretraining, pretrained model
A model that suggests additional items based on the user behavior of similar users. Example: ‘Users who viewed this item also viewed the following other items…’
Robotic journalism is a form of journalism that uses computer-controlled programs to create journalistic content. This content can be news reports, sports scores, weather reports, financial reports, stock market reports, and other forms of journalism.
search engine marketing (sem)
Search engine marketing (SEM) is one of the most important types of online marketing. It can be divided into search engine advertising (SEA). Here, advertisers pay for their own website to be listed above other websites. This paid advertising is displayed, for example, on Google in special areas on the search results page (SERP – Search Engine Result Page) and marked as such. Another method is search engine optimization (SEO or Search Engine Optimization). Here, one is trying to get one’s own Internet presence displayed as high up as possible in the organic search results.
A statistical model makes predictions about input data based on learned patterns. Language models, for example, predict the next word in an input sentence. A model has an architecture and must be trained to build the statistical model.
Text mining is data mining specifically for written data in natural language. The text mining process involves the use of algorithms and methods to extract valuable information from unstructured or semi-structured text data, identify new patterns, confirm existing patterns, or make predictions. The insights gained can be applied in many fields, such as science, marketing, customer service or finance.
Text spinning or article spinning is a technique aimed at changing a text to make it more appealing to a specific target audience. It involves replacing words, changing sentence structures, and inserting new words. The actual content of the text remains unchanged. This is relevant, for example, when creating new texts for search engine marketing (SEM), especially for search engine optimization (SEO or search engine marketing).
While training, a model learns from examples. Based on the examples, the model tries to predict an outcome (for example, filling in a cloze correctly) and compares its results with the real values at the end of each cycle. If the result is wrong, the underlying statistical model is adjusted and a new attempt is started. Usually, a training runs until the statistical model hardly changes anymore, i.e. the results become stable. This can be the case after a few minutes (classic machine learning) or weeks/months (deep learning on very large data sets).