MongoDB schema design.

Embed vs reference

Embed when:

  • 1-to-few (≤100).
  • Read together always.
  • Sub-doc doesn’t outlive parent.
{
    _id: ObjectId(),
    name: "Alice",
    addresses: [
        { type: "home", street: "..." },
        { type: "work", street: "..." },
    ],
}

Reference when:

  • Many-to-many.
  • Sub-docs grow unbounded.
  • Independent lifecycle.
// users
{ _id: ObjectId("u1"), name: "Alice" }

// posts
{ _id: ObjectId("p1"), author_id: ObjectId("u1"), title: "..." }

Hybrid (cached fields)

Store frequently-read fields denormalized:

// post with cached author name
{ _id: ..., title: ..., author: { id: ..., name: "Alice" } }

Re-sync on author rename.

16MB doc limit

Single doc max 16MB. If approaching, split.

Array size

Avoid unbounded arrays. Cap with $slice or move to separate collection.

db.feeds.updateOne(
    { _id: user_id },
    { $push: { items: { $each: [new_item], $slice: -100 } } }
)

Keeps last 100.

Indexing-aware schema

If you query by user_id, store it in the doc (don’t embed inside another struct that hides it).

Polymorphic

{ type: "image", url: "..." }
{ type: "video", url: "...", duration: 120 }

Discriminator field. Add validation:

db.createCollection("media", {
    validator: {
        $jsonSchema: {
            bsonType: "object",
            required: ["type", "url"],
            properties: {
                type: { enum: ["image", "video"] },
                duration: { bsonType: "number" },
            },
        },
    },
})

Bucketing (time-series)

{
    sensor_id: ...,
    bucket_start: ISODate("2026-01-15T10:00:00Z"),
    measurements: [
        { ts: ..., temp: 20 },
        { ts: ..., temp: 21 },
        ...
    ],
    count: 600,
}

Group 1-min/1-hr of data into one doc.

Better: native time-series collections (MongoDB 5+).

db.createCollection("metrics", {
    timeseries: {
        timeField: "ts",
        metaField: "sensor_id",
        granularity: "minutes",
    },
})

Anti-patterns

  • Single huge doc with millions of array entries.
  • “User” collection with 30 different shapes.
  • Embedding the world (denormalize too much, update nightmare).
  • All references (joins via $lookup are slow).
  • Storing booleans as strings (“true”).

Validators

db.runCommand({
    collMod: "users",
    validator: { $jsonSchema: { required: ["email"] } },
    validationLevel: "strict",
    validationAction: "error",
})

Sparse fields

Optional fields: just omit; doesn’t take space.

Common mistakes

  • $lookup as default join (slow at scale).
  • 16MB limit ignored until it bites.
  • No _id index considered (always indexed by default).
  • Mixing string + ObjectId for same field.
  • Using arrays for sets (no uniqueness; use $addToSet).

Read this next

If you want my schema patterns, they’re at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .