Skip to main content

Bulk lead ingestion

Problem: Every org that onboards has an existing contact base (CSV, CRM export, spreadsheet). These contacts need to exist as leads in the system before they can receive personalized dispatches. The current architecture handles organic lead creation (1 at a time via events), but not bulk import (thousands/millions at once). This also affects campaign dispatch: if an org wants to dispatch to 300k contacts that don’t exist as leads yet, those leads need to be created first. Creating them one-by-one via API is not viable at scale. How leads enter the system today:
MethodVolumeSolved?
Organic (lead sends a message)1 at a time, realtimeYes — Event Ingester creates automatically
Webhook (purchase, form submission)1 at a time, realtimeYes — Event Ingester creates automatically
Bulk import (CSV, CRM migration)Thousands/millions, batchNot yet
Options under consideration: A) Batch import endpoint on the Lead API
POST /organizations/:orgId/leads/import
{ "s3Key": "imports/org-abc/contacts.csv" }
→ Returns jobId, processes asynchronously
→ Lambda reads CSV from S3, bulk inserts Lead + ChannelIdentity + LeadOrganization
→ Notifies when complete
  • Lead domain owns the import pipeline
  • Dispatch waits for import to complete before sending
  • Clean separation: Lead domain manages leads, Messaging dispatches
B) Resolve-or-create batch endpoint
POST /organizations/:orgId/leads/resolve-batch
[
  { "channel": "whatsapp", "channelIdentifier": "+5511999", "name": "Joao" },
  { "channel": "whatsapp", "channelIdentifier": "+5511888", "name": "Maria" },
  ...
]
→ Returns existing leads if found, creates new ones if not
→ Idempotent — safe to call multiple times
  • Handles both new and existing leads uniformly
  • Messaging calls this before dispatching
  • Simpler contract, but requires batching logic on the caller side
C) Dispatch creates leads implicitly
  • Messaging dispatches to contacts regardless of whether they exist as leads
  • Each campaign_delivered event triggers lead creation in the Event Ingester
  • First dispatch uses data from the CSV (no memories available), subsequent interactions have context
  • No coordination needed between import and dispatch
  • But: leads are created asynchronously after dispatch, so there’s a window where the lead doesn’t exist yet
Key consideration: The first dispatch to a new contact is always “blind” — there are no memories, no features, no history. Template variables can only use the data the org already has (name, phone from CSV). Personalized dispatches with memories only work for leads that have prior interactions. Recommendation direction: Options A or B are more robust long-term. Bulk import is needed regardless of dispatch — every onboarding org needs it. Making it a first-class Lead domain operation keeps responsibilities clean. The user-facing flow (backoffice/campaign UI) can abstract the two steps into a single “upload contacts and dispatch” experience. Status: To be defined.

Lead uniqueness and shared identifiers

Problem: Lead.email and Lead.phone are currently globally unique. But real-world scenarios break this:
  • A lead provides their parent’s email as their own
  • Family members share a phone number
  • A company phone is used by multiple employees
Since identity resolution is exclusively via ChannelIdentity, Lead.email and Lead.phone are profile data, not identifiers. Global uniqueness constraints on them may cause false merges (two different people treated as the same lead because they share an email). Options:
  • Remove unique constraints on Lead.email and Lead.phone. Two different leads can have the same email. ChannelIdentity UNIQUE(channel, channelIdentifier) remains the sole uniqueness guarantee.
  • Keep unique constraints but handle conflicts gracefully (reject, prompt for resolution).
Status: To be defined.

Other open items

DecisionStatus
Event archival strategy (TTL, S3, keep all)To be defined with real volume
Complete list of normalized event_typesIterative, grows with new integrations
Memory derivation rules per event_typeTo be defined per integration
Analytics pipeline (S3 + Athena)Concept defined, details pending
Full Webhook Domain contract (EventNormalized)To be defined separately
Events published by the Lead domainTo be defined when other domains need to react
AI-powered memory inference from open-ended formsModel supports it (agent_inferred + confidence), implementation pending