MongoDB schema design.
Embed vs reference
Embed when:
- 1-to-few (≤100).
- Read together always.
- Sub-doc doesn’t outlive parent.
{
_id: ObjectId(),
name: "Alice",
addresses: [
{ type: "home", street: "..." },
{ type: "work", street: "..." },
],
}
Reference when:
- Many-to-many.
- Sub-docs grow unbounded.
- Independent lifecycle.
// users
{ _id: ObjectId("u1"), name: "Alice" }
// posts
{ _id: ObjectId("p1"), author_id: ObjectId("u1"), title: "..." }
Hybrid (cached fields)
Store frequently-read fields denormalized:
// post with cached author name
{ _id: ..., title: ..., author: { id: ..., name: "Alice" } }
Re-sync on author rename.
16MB doc limit
Single doc max 16MB. If approaching, split.
Array size
Avoid unbounded arrays. Cap with $slice or move to separate collection.
db.feeds.updateOne(
{ _id: user_id },
{ $push: { items: { $each: [new_item], $slice: -100 } } }
)
Keeps last 100.
Indexing-aware schema
If you query by user_id, store it in the doc (don’t embed inside another struct that hides it).
Polymorphic
{ type: "image", url: "..." }
{ type: "video", url: "...", duration: 120 }
Discriminator field. Add validation:
db.createCollection("media", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["type", "url"],
properties: {
type: { enum: ["image", "video"] },
duration: { bsonType: "number" },
},
},
},
})
Bucketing (time-series)
{
sensor_id: ...,
bucket_start: ISODate("2026-01-15T10:00:00Z"),
measurements: [
{ ts: ..., temp: 20 },
{ ts: ..., temp: 21 },
...
],
count: 600,
}
Group 1-min/1-hr of data into one doc.
Better: native time-series collections (MongoDB 5+).
db.createCollection("metrics", {
timeseries: {
timeField: "ts",
metaField: "sensor_id",
granularity: "minutes",
},
})
Anti-patterns
- Single huge doc with millions of array entries.
- “User” collection with 30 different shapes.
- Embedding the world (denormalize too much, update nightmare).
- All references (joins via $lookup are slow).
- Storing booleans as strings (“true”).
Validators
db.runCommand({
collMod: "users",
validator: { $jsonSchema: { required: ["email"] } },
validationLevel: "strict",
validationAction: "error",
})
Sparse fields
Optional fields: just omit; doesn’t take space.
Common mistakes
- $lookup as default join (slow at scale).
- 16MB limit ignored until it bites.
- No
_idindex considered (always indexed by default). - Mixing string + ObjectId for same field.
- Using arrays for sets (no uniqueness; use
$addToSet).
Read this next
If you want my schema patterns, they’re at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .