Notes: Data Models and Query Languages

source: Designing Data Intensive Applications by Martin Kleppmann (Chapter 2)

Data models help you abstract out information, which helps you present it efficiently the next level up (or the next level down).

Relational vs Document Models

Relational Model

Proposed in 1970, the basis for SQL. Data is organized into:

relations (tables)
tuples (rows)

The goal of the relational model is to hide implementation details. The relational model is remarkably adaptive. You can compare it to the network model and the hierarchical model.

NoSQL

Document Model. NoSQL databases will probably exist alongside relational databases. This is called polyglot persistence. Advantages of NoSQL:

number of open source tools
scalability (for for large datasets and datasets with high write throughput)
there are some specific query operations that don’t exist for relational databases
they don’t have restrictive schemas

The Object-Relational Mismatch

One disadvantage of relational databases is that they don’t play well with object-oriented applications. This disconnect is called the impedance mismatch

In order to represent an object that an application needs, your database often has to do a fair amount of joining
Solution: Object-Relational Mapping Framworks (ORMs), tools to reduce the translation layer
Another Solution: JSON models (or document databases) can reduce the impedance mismatch.
- These work best when data is contained in one place(locality)

Many-to-Many/One Relationships

Plenty of fields that can be stored as a string can also be stored as an ID that you can join on. Advantages:

creates a single source of truth. This reduces redundant data and ambiguity
Makes updating easier
called Normalizing data

Hierarchical Models

model of storing data from the 1960’s, similar to JSON. This model was bad for many-to-many relationships and bad for normalization, because you often had to duplicate data

Denormalization: duplicating data and storing it in different places (like using a string, instead of an ID and just joining the tables)

The Network Model

A different old data model. Links between records were basically like pointers, and you had to travers the path to get all the information

Access path: path from the root of a chain of nodes The network model is basically dead

Query Optimizers

These are for relational databases. Since the structure of a relational database isn’t too different from a database that uses the network model, query optimizers are used to determine the access path. This means:

which parts of the query to execute in which order
which indexes to use Note: When you declare a new index, you give your query optimizer a new option when it comes to creating an access path and querying the data

Document Databases and Joins

Document databases will use references (foreign keys) to handle many-to-many/many-to-one relationships

Optimizing for Simple Code

Document Models: Might fit an application better if loading a managably-sized tree of one-to-many relationships. It reduces the need for joins by denormalizing data
- disadvantage: Difficult to refer to objects that are nested within a document
- disadvantage: You might emulate a join by handling it with your application. This is less performative and makes your application more complex
Relational Models: Good fit if you have a lot of many-to-many relationships
- shredding: splitting a document up into several tables. This can be a disadvantage of the relational model

Query Languages for Data

declarative vs imperative: In a declarative query language, you just declare what you want, and the database takes care of the implementation. This is opposed to an imperative language, where you tell the database exactly how to get the information.

Declarative Queries

SQL is an example of a declarative language, but not the only one. CSS is also declarative. Advantages of declarative languages:

easier to work with and learn
implementation details are hidden from the user. This means that they can be updated without the user noticing
easier to parallelize. Imperative operations must run in a sequence

MapReduce Querying

Made to process large amounts of data in bulk across machines. It’s not quite declarative or imperative, but kind of in the middle. It uses a series of pure functions to mold and aggregate data, then present that data neatly.

pure functions are required so that the database can run them in any order without messing things up
MapReduce uses javascript
A little more difficult to learn and use
alternative: aggregation pipeline (MongoDB)

Graph-like Data Models

good for handling complex many-to-many relationships, like social networks or links between pages on the internet.

vertices (nodes, entities)
edges (relationships, arcs)

Property Graphs

A combination of vertices and edges, makes for flexible arrangements. You can build up a graph that you can traverse that represents complex relationships very simply

The Cypher Query Language

A declarative query language for property graphs. One alternative to cypher query languages is to store the information in relational databases and use SQL’s recursive joins, but the queries end up being imperative, so they’re longer, harder to read/maintain, and less performative.

Triple Stores, SPARQL

The triple store model is similar to the property graph model, but the data is arranged in a slightly different mannner (subject, predicate/verb, object). The predicate here is like an edge that connects two vertices

The Semantic Web

Concept of making pages machine readable

The RDF Data Model

Resource Description Framework (RDF) - mechanism for websites to publish data in a consisten format. The goal of RDF is to create a “web of data”, or a “database of everything”. Examples of this include Turtle language, or XML

designed to use triple stores, but in a format that is tolerant to different predicates. This makes it easier for two organizations to work together

SPARQL

Query language for triple stores using the RDF data model. Structure is similar to Cypher because RDF doesn’t distinguish between properties and edges (both are predicates)

Datalog

from the 1980’s, uses triple store model. The Rules define what to do