MongoDB Cheatsheet 02 — Schema Design

MongoDB schema design.

Embed vs reference

Embed when:

1-to-few (≤100).
Read together always.
Sub-doc doesn’t outlive parent.

{
    _id: ObjectId(),
    name: "Alice",
    addresses: [
        { type: "home", street: "..." },
        { type: "work", street: "..." },
    ],
}

Reference when:

Many-to-many.
Sub-docs grow unbounded.
Independent lifecycle.

// users
{ _id: ObjectId("u1"), name: "Alice" }

// posts
{ _id: ObjectId("p1"), author_id: ObjectId("u1"), title: "..." }

Hybrid (cached fields)

Store frequently-read fields denormalized:

// post with cached author name
{ _id: ..., title: ..., author: { id: ..., name: "Alice" } }

Re-sync on author rename.

16MB doc limit

Single doc max 16MB. If approaching, split.

Array size

Avoid unbounded arrays. Cap with $slice or move to separate collection.

db.feeds.updateOne(
    { _id: user_id },
    { $push: { items: { $each: [new_item], $slice: -100 } } }
)

Keeps last 100.

Indexing-aware schema

If you query by user_id, store it in the doc (don’t embed inside another struct that hides it).

Polymorphic

{ type: "image", url: "..." }
{ type: "video", url: "...", duration: 120 }

Discriminator field. Add validation:

db.createCollection("media", {
    validator: {
        $jsonSchema: {
            bsonType: "object",
            required: ["type", "url"],
            properties: {
                type: { enum: ["image", "video"] },
                duration: { bsonType: "number" },
            },
        },
    },
})

Bucketing (time-series)

{
    sensor_id: ...,
    bucket_start: ISODate("2026-01-15T10:00:00Z"),
    measurements: [
        { ts: ..., temp: 20 },
        { ts: ..., temp: 21 },
        ...
    ],
    count: 600,
}

Group 1-min/1-hr of data into one doc.

Better: native time-series collections (MongoDB 5+).

db.createCollection("metrics", {
    timeseries: {
        timeField: "ts",
        metaField: "sensor_id",
        granularity: "minutes",
    },
})

Anti-patterns

Single huge doc with millions of array entries.
“User” collection with 30 different shapes.
Embedding the world (denormalize too much, update nightmare).
All references (joins via $lookup are slow).
Storing booleans as strings (“true”).

Validators

db.runCommand({
    collMod: "users",
    validator: { $jsonSchema: { required: ["email"] } },
    validationLevel: "strict",
    validationAction: "error",
})

Sparse fields

Optional fields: just omit; doesn’t take space.

Common mistakes

$lookup as default join (slow at scale).
16MB limit ignored until it bites.
No _id index considered (always indexed by default).
Mixing string + ObjectId for same field.
Using arrays for sets (no uniqueness; use $addToSet).

Embed vs reference#

Hybrid (cached fields)#

16MB doc limit#

Array size#

Indexing-aware schema#

Polymorphic#

Bucketing (time-series)#

Anti-patterns#

Validators#

Sparse fields#

Common mistakes#

Read this next#