Parent categories: Database technology.

Document based databases

How should we use document based databases?

YAML Interest

MarkLogic, MongoDB, couchbase are document databases.

They're often very fast.

But unfortunately compared to relational databases they often do not support efficient joins.


No child categories.


Vote (Optional) (suppress notifications) (Optional)
Please, log in.

How to build a document based database that supports Joins?

A common weakness of document databases is that they often do not support efficient joins. They represent document records as an opaque blob.

I created the attached project to talk of how I plan to implement document based storage using keyvalue storage as the backend. Each key of the document is a separate key and they are arranged so that range scans are efficient so that hash joins can take place efficiently.



    : Bassxn2
    :  -- 
    :  -- 
    

chronological,

The keyword is "efficient". Efficiency is inversely proportional to computational complexity, and so, I assume, you look for new algorithms for joins with unstructured data.

First, the problem is already solved in SQL databases, right? Why not to take a look at the implementation, and take it from there?

Let's say we have raw data as records of JSON (or dictionaries, hashmaps). What you're concerned about then is efficient querying, which is a subject of indexing (query-optimizing, or query algorithms). We routinely index SQL databases into ElasticSearch, because SQL databases are not good or not flexible enough in text search in other ways that users care about: we use another data system, that is good at it, and keep a copy of data in there. Not very space-saving, but works. We could do the same with NoSQL -- if you need join-like queries -- just "index" data into SQL databases, by specialized processes, that interprets and migrates SQL database on the fly, working as a complementary job in concert with the NoSQL, always looking for new fields, and creating those fields in the complementary SQL database. Sure, using many databases at once is not an elegant solution, so I agree, that we need improvement of document based databases. After all, schemas are not non-existent, every record implies a schema some sort, and when sufficiently many records share certain fields, it may justify creation of new SQL field or foreign key. Think of it like a brain that realizes new "laws of physics" when one sees sufficiently many examples of a specific type...


I've designed a keyspace for JSON that is fast to decode back into JSON and is fast to scan in a RocksDB keyvalue database range scan.

This lets us do a regular hash join as a relational database does.

This JSONs { "name": "Samuel Squire",

"job": {

"currentJob": {"company": {"employeeCount": 2500}}

}

}

{"_id": "1",

"name": "Samuel Squire",

"hobbies": [

{"name": "God"}, {"name": "databases"}, {"name":"multicomputer systems"}

] } Is turned into at least the following keyvalue objects

0.0 = "Samuel Squire"

0.1.0.0.0 = "2500"

0.0 = "Samuel Squire"

0.1.0.0 = "God" 0.1.1.0 = "databases" 0.1.2.0 = "multicomputer systems"

Essentially form a flat structure of the document with keys.

"type people": "object",

"type people.*": "list",

"type people.*.0": "string",

"type people.*.1": "list",

"type people..1..0": "string",

"type people..1.": "object",

"type people.*.2": "object",

"type people.*.2.0": "object",

"type people.*.2.0.0": "object",

"type people.*.2.0.0.0": "number",

"field people.*": "LIST",

"field people..1.": "LIST",

"field people.*.0": "name",

"field people.*.1": "hobbies",

"field people..1..0": "name",

"field people.*.2": "job",

"field people.*.2.0": "currentJob",

"field people.*.2.0.0": "company",

"field people.*.2.0.0.0": "employeeCount",

"field people": "people",

"field people.*": "LIST",

"field people.*.3": "words",

"field people..3.": "LIST",

"field people..3..*":"LIST",

"field people..3...": "LIST",

"type people.*.3": "list",

"type people..3.": "list",

"type people..3..*": "list",

"type people..3...": "list",

"type people..3....*": "number"



    : Mindey
    :  -- 
    :  -- 
    

chronological,

Your idea of synchronizing a SQL database with a document store is similar to my thought of synchronizing a SQL database with dyanamodb which is a fast keyvalue store.

I want the best of NoSQL performance but the power of SQL joins.



    : Mindey
    :  -- 
    :  -- 
    

chronological,