Data science & AI pillar · 38 definitions

Data science & artificial intelligence

Clear, citable definitions of the core concepts of data science, computer science and artificial intelligence — from machine learning, deep learning and neural networks to large language models, generative AI and big data. A neutral, vendor-independent reference written for research, not for marketing.

Data science and AI — machine learning, neural networks and large language models

From data to artificial intelligence

Artificial intelligence, machine learning and deep learning are nested ideas: AI is the broad goal, machine learning is the data-driven approach that now dominates it, and deep learning is the neural-network technique that powers its most striking recent results. Understanding the neural network at the centre of these methods is the key to making sense of the field.

Generative AI and language models

The current wave of generative AI is built on large language models and related architectures, supported by techniques such as natural language processing, tokenisation and prompt engineering. These pages explain how the methods work and where their well-documented limitations lie.

The data and computing foundations

Underneath the models sit the foundations: big data, data analysis and analytics, dimensionality-reduction methods such as principal component analysis, and the cloud and quantum computing infrastructure that research increasingly relies on. The definitions below cover the whole stack.

Explore the pillar

Data science & AI definitions

Artificial intelligence

What is artificial intelligence?

Artificial intelligence is the field of computer science concerned with building systems that perform tasks normally associated with human intelligence, such as perception, reasoning, learning, and language, and with the study of the methods that make this possible.

Read →

Machine learning

What is machine learning?

Machine learning is the branch of artificial intelligence in which computer systems improve at a task by learning patterns from data, rather than being explicitly programmed with fixed rules for every case.

Read →

Deep learning

What is deep learning?

Deep learning is a branch of machine learning that uses neural networks with many layers to learn increasingly abstract representations of data, enabling strong performance on tasks such as image recognition and language understanding.

Read →

Neural network

What is a neural network?

A neural network is a computational model made of interconnected nodes, or artificial neurons, arranged in layers that transform input data into an output by adjusting the strengths of their connections during training.

Read →

Natural language processing

What is natural language processing?

Natural language processing is the field at the intersection of computer science, artificial intelligence, and linguistics concerned with enabling computers to read, interpret, generate, and respond to human language.

Read →

Large language model

What is a large language model?

A large language model is a neural network trained on very large amounts of text to predict the next unit of language, allowing it to generate fluent text and perform a wide range of language tasks.

Read →

Generative AI

What is generative AI?

Generative AI refers to artificial-intelligence systems that produce new content — such as text, images, audio, video, or code — by learning the patterns of their training data and sampling from a learned distribution.

Read →

GAN

What is a generative adversarial network?

A generative adversarial network is a machine-learning framework in which two neural networks — a generator and a discriminator — are trained in competition, so that the generator learns to produce increasingly realistic synthetic data.

Read →

Big data

What is big data?

Big data refers to datasets so large, fast-moving, or varied that they exceed the capacity of traditional data-processing tools, requiring distributed storage and computation to capture, manage, and analyse them.

Read →

Data science

What is data science?

Data science is an interdisciplinary field that combines statistics, computer science, and domain knowledge to extract knowledge and actionable insight from data through a structured process of collection, analysis, and interpretation.

Read →

Data analysis

What is data analysis?

Data analysis is the process of inspecting, cleaning, transforming, and modelling data in order to discover useful information, draw conclusions, and support decision-making.

Read →

Data analytics

What is data analytics?

Data analytics is the discipline of examining data to draw conclusions and inform decisions, spanning descriptive, diagnostic, predictive, and prescriptive approaches that answer progressively more demanding questions.

Read →

Prompt engineering

What is prompt engineering?

Prompt engineering is the practice of designing and refining the inputs given to a generative AI model so that it produces more useful, accurate, or appropriately formatted outputs.

Read →

PCA

What is principal component analysis?

Principal component analysis is a statistical technique that reduces the dimensionality of a dataset by transforming its variables into a smaller set of uncorrelated components ordered by how much variance they capture.

Read →

Computer vision

What is computer vision?

Computer vision is the field of artificial intelligence concerned with enabling computers to interpret and understand visual information from images and video, extracting meaningful descriptions of their content.

Read →

Quantum computing

What is quantum computing?

Quantum computing is a form of computation that uses the principles of quantum mechanics — such as superposition and entanglement — to process information in ways that classical computers cannot efficiently replicate for certain problems.

Read →

Cloud computing

What is cloud computing?

Cloud computing is the on-demand delivery of computing resources — such as servers, storage, databases, and software — over the internet, allowing users to access scalable capacity without owning the underlying hardware.

Read →

Computer science

What is computer science?

Computer science is the study of computation, algorithms, information, and the design of computing systems — spanning the theoretical foundations of what can be computed and the practical engineering of software and hardware.

Read →

Supervised vs unsupervised

Supervised vs unsupervised learning

Supervised and unsupervised learning are the two foundational paradigms of machine learning, distinguished by whether the training data is labelled with the correct answers or not.

Read →

Tokenization

What is tokenization?

Tokenization is the process of breaking text into smaller units called tokens — such as words, subwords, or characters — so that a computer can process language numerically, a foundational step in NLP and large language models.

Read →

Supervised learning

What is supervised learning?

Supervised learning is the branch of machine learning in which a model is trained on labelled examples, learning a mapping from inputs to known outputs so it can predict the correct label for new, unseen data.

Read →

Unsupervised learning

What is unsupervised learning?

Unsupervised learning is the branch of machine learning that finds structure in data without labelled outputs, discovering patterns such as clusters or a lower-dimensional representation directly from the inputs.

Read →

Reinforcement learning

What is reinforcement learning?

Reinforcement learning is the branch of machine learning in which an agent learns to make decisions by interacting with an environment, choosing actions that maximise a cumulative reward signal through trial and error.

Read →

Overfitting

What is overfitting?

Overfitting is when a machine-learning model fits the training data too closely, capturing noise as well as signal, so that it performs well on training examples but generalises poorly to new, unseen data.

Read →

Cross-validation

What is cross-validation?

Cross-validation is a resampling method for estimating how well a model will generalise to unseen data, by repeatedly partitioning the dataset into training and validation subsets and averaging the results.

Read →

Confusion matrix

What is a confusion matrix?

A confusion matrix is a table that summarises the performance of a classifier by tabulating its predictions against the true labels, showing the counts of true positives, true negatives, false positives, and false negatives.

Read →

Precision and recall

What are precision and recall?

Precision and recall are two metrics for evaluating a classifier: precision measures how many of the items predicted positive are actually positive, while recall measures how many of the actual positives the model successfully finds.

Read →

F1 score

What is the F1 score?

The F1 score is a single metric that combines precision and recall by taking their harmonic mean, giving a balanced measure of a classifier's performance that penalises models which neglect either one.

Read →

Gradient descent

What is gradient descent?

Gradient descent is an iterative optimisation algorithm that minimises a function by repeatedly taking steps in the direction of steepest descent, widely used to train machine-learning models by reducing their error.

Read →

Decision tree

What is a decision tree?

A decision tree is a machine-learning model that makes predictions by following a branching sequence of simple tests on the input features, splitting the data into ever more specific groups until it reaches a decision.

Read →

Random forest

What is a random forest?

A random forest is a machine-learning model that combines the predictions of many decision trees, each trained on a different random sample of the data and features, to produce more accurate and stable results than a single tree.

Read →

Clustering

What is clustering?

Clustering is an unsupervised machine-learning task that groups a set of data points so that those in the same group are more similar to each other than to those in other groups, revealing natural structure without labels.

Read →

K-means clustering

What is k-means clustering?

K-means clustering is a popular unsupervised algorithm that partitions data into k groups by repeatedly assigning each point to its nearest cluster centre and then recomputing those centres until the assignment stabilises.

Read →

Support vector machine

What is a support vector machine?

A support vector machine is a supervised machine-learning algorithm that classifies data by finding the boundary that separates the classes with the widest possible margin, and can model non-linear boundaries using kernels.

Read →

Transformer model

What is a transformer model?

A transformer is a neural-network architecture, introduced in 2017, that uses a self-attention mechanism to weigh the relationships between all elements of a sequence at once, and now underpins most modern large language models.

Read →

Recurrent neural network

What is a recurrent neural network?

A recurrent neural network is a type of neural network with connections that loop back, giving it a form of memory that lets it process sequential data such as text, speech, or time series one element at a time.

Read →

Feature engineering

What is feature engineering?

Feature engineering is the process of selecting, creating, and transforming the input variables used by a machine-learning model, with the aim of representing the data in a way that makes patterns easier for the model to learn.

Read →

Vector database

What is a vector database?

A vector database is a database designed to store and search high-dimensional vectors, called embeddings, by similarity rather than exact match, enabling semantic search and powering retrieval-augmented generation for AI systems.

Read →

Common questions

Data science & AI FAQ

What is the difference between AI, machine learning and deep learning?+

Artificial intelligence is the broad goal of building systems that perform tasks associated with human intelligence. Machine learning is a subset of AI in which systems learn patterns from data. Deep learning is a subset of machine learning that uses multi-layer neural networks. Each is nested inside the previous.

What is a large language model?+

A large language model is a neural network — typically based on the transformer architecture — trained on very large amounts of text to predict the next token in a sequence. This lets it generate and process language, though it can also produce confident but incorrect output, known as hallucination.

What is data science?+

Data science is an interdisciplinary field that combines statistics, computer science and domain knowledge to extract insight and knowledge from data. It spans the whole workflow from collecting and cleaning data to analysis, modelling and communicating results.

Are these pages tutorials or product recommendations?+

Neither. They are neutral, vendor-independent definitions that explain how each concept is defined and used in research. They do not recommend tools or provide code tutorials — the focus is the standards-and-methodology view of each concept.

How does this relate to CASRAI standards?+

CASRAI is a research-standards body. Data-intensive and AI research depends on good research-data management, reproducibility and transparent reporting — for example treating data and models as FAIR — which is the standards layer CASRAI maintains for the research ecosystem.

Going deeper

Related on CASRAI

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.