How to Query Mixpanel Pipeline Data in BigQuery: Deduplication, Timezone, and User Identity

The first time most data teams run a COUNT on their Mixpanel pipeline export in BigQuery, they get a number that doesn’t match Mixpanel’s UI. Then they add a timezone filter and get a different wrong number. Then they try to count unique users and get a third wrong number.

None of this means the pipeline is broken. It means raw pipeline exports require specific handling that Mixpanel’s UI does silently in the background — deduplication, timezone conversion, and identity resolution — and if you don’t replicate that logic in your SQL, your numbers will always be off.

This post walks through the four foundational queries you need when working with Mixpanel data in BigQuery. These are the patterns I use as the starting point for any warehouse analysis on Mixpanel exports, and they address the three most common correctness issues in sequence.

Why Raw Pipeline Data Needs Special Handling

Before getting into the queries, it’s worth understanding why each of these patterns exists. There are three distinct reasons your raw queries will produce wrong results without them.

Duplicates exist in raw exports. Mixpanel’s pipeline exports can contain duplicate event rows. This happens because of how the incremental export system works — the same event can appear in multiple export batches as late-arriving data gets appended. The Mixpanel UI deduplicates automatically before showing you numbers. Your warehouse queries need to do the same thing explicitly.

All data is exported in UTC. Your Mixpanel project is almost certainly configured to a local timezone. The UI converts event timestamps to your project timezone before displaying them. Pipeline exports write everything in UTC. If you filter by date without converting to your project timezone first, you’re cutting your data on midnight UTC rather than midnight local time — which means daily counts will be wrong and date ranges won’t align with what you see in Mixpanel.

Raw events carry unresolved user IDs. Mixpanel’s identity merge system resolves multiple identifiers — anonymous device IDs, pre-login IDs, post-login user IDs — into a single canonical identity for each user. The UI applies this resolution before counting unique users. Pipeline exports carry the original distinct_id that was sent at the time of the event, before resolution. Without joining to the identity mappings table, your unique user counts will overcount because each anonymous session looks like a different user.

Each of the queries below addresses one or more of these issues. Replace <your dataset> with your actual BigQuery dataset name and adjust the timezone string and date range to match your project.

Query 1: Total Daily Event Count With Deduplication and Timezone Adjustment

This is the most foundational query. It answers the question “how many events happened each day?” in a way that actually matches Mixpanel’s UI.

SELECT
  DATE(time, 'America/Los_Angeles') AS event_date,
  COUNT(DISTINCT CONCAT(event_name, time, distinct_id, insert_id)) AS event_count
FROM
  `<your dataset>.mp_master_event`
WHERE
  DATE(time, 'America/Los_Angeles') >= '2025-08-01'
  AND DATE(time, 'America/Los_Angeles') < '2025-09-16'
GROUP BY
  1
ORDER BY
  1 ASC

Two things are happening here that don’t exist in a naive COUNT query.

Timezone conversion on every date reference. The DATE(time, 'America/Los_Angeles') function converts the UTC timestamp to Pacific time before grouping or filtering. This appears three times in the query — once in the SELECT for grouping, and twice in the WHERE clause for date range filtering. All three need the conversion. If you convert in the SELECT but not the WHERE, your date range filtering still runs on UTC and you’ll cut the data at the wrong boundary.

Replace 'America/Los_Angeles' with your project’s actual timezone. Common alternatives include 'America/New_York', 'Europe/London', 'Asia/Singapore', or 'UTC' if your project is already configured in UTC.

Deduplication via COUNT DISTINCT on a composite key. The deduplication logic uses four fields concatenated together: event_name, time, distinct_id, and insert_id. These four fields together uniquely identify a real event occurrence. If the same event appears twice in the export — which happens with incremental pipeline batches — both rows will produce the same concatenated string, and COUNT DISTINCT will count them once.

This is the deduplication approach Mixpanel documents for pipeline data. Don’t use COUNT(*) for event counts from raw pipeline exports. It will overcount on any project that has had late-arriving data or pipeline sync runs.

Query 2: Unique User Count With Identity Resolution

This query answers “how many unique users triggered events each day?” in a way that accounts for Mixpanel’s identity merge and matches what the UI reports.

SELECT
  DATE(time, 'America/Los_Angeles') AS event_date,
  COUNT(DISTINCT resolved_user_id) AS unique_users
FROM (
  SELECT
    time,
    IFNULL(id_mappings.resolved_distinct_id, events.distinct_id) AS resolved_user_id
  FROM
    `<your dataset>.mp_master_event` AS events
  LEFT JOIN
    `<your dataset>.mp_identity_mappings_data_view` AS id_mappings
  ON
    events.distinct_id = id_mappings.distinct_id
  WHERE
    DATE(time, 'America/Los_Angeles') >= '2025-08-01'
    AND DATE(time, 'America/Los_Angeles') < '2025-09-16'
)
GROUP BY
  1
ORDER BY
  1 ASC

The structure here is a subquery that resolves identities before the outer query counts distinct users.

The LEFT JOIN to the identity mappings table. The mp_identity_mappings_data_view maps original distinct_id values to their resolved_distinct_id — the canonical identifier Mixpanel uses for that user after applying identity merge. LEFT JOIN means events with no matching identity mapping are kept in the result rather than dropped.

The IFNULL fallback logic. IFNULL(id_mappings.resolved_distinct_id, events.distinct_id) is the core resolution logic. If a mapping exists for this distinct_id, use the resolved identifier. If no mapping exists — meaning this identifier was never merged with another — use the original distinct_id from the events table. This fallback is essential: not every user in your events table will have a corresponding row in the identity mappings table, and dropping unresolved identifiers would undercount your users.

The outer query then counts DISTINCT resolved identifiers, which now correctly treats a user’s anonymous pre-login sessions and authenticated post-login sessions as the same person.

Without this join, every anonymous session appears as a unique user, inflating your unique user count significantly for any product with a meaningful anonymous-to-authenticated user journey.

Query 3: Top 20 Events by Volume

This query is simpler but still requires timezone handling. It answers “which events fire most frequently?” and is useful for the kind of Lexicon volume audit discussed earlier in this series — identifying high-volume events that are either well-used or candidates for blocking.

SELECT
  event_name,
  COUNT(*) AS event_count
FROM
  `<your dataset>.mp_master_event`
WHERE
  DATE(time, 'America/Los_Angeles') >= '2025-08-01'
  AND DATE(time, 'America/Los_Angeles') < '2025-09-16'
GROUP BY
  1
ORDER BY
  2 DESC
LIMIT
  20

This one uses COUNT(*) rather than the deduplication pattern from Query 1 — which is appropriate here because you’re ranking events by relative volume, not computing an exact deduplicated count. The duplicate rate is roughly uniform across events, so the ranking holds even without deduplication. If you need exact deduplicated counts per event name for billing reconciliation or precise volume comparison with the UI, apply the CONCAT deduplication from Query 1 here as well.

Adjust the LIMIT to see more or fewer events. Sorting by event_count DESC puts your highest-volume events at the top, which is the right starting point for any cleanup or governance exercise.

Query 4: Identifying Duplicate Events in Raw Data

This query is diagnostic rather than analytical. Use it when you want to understand the scale of your duplication problem — how many events have duplicates, which events are affected most, and what the duplicate rows look like side by side.

SELECT
  *,
  COUNT(*) OVER (PARTITION BY event_name, time, distinct_id, insert_id) AS dup_group_size
FROM
  `<your dataset>.mp_master_event`
WHERE
  DATE(time, 'America/Los_Angeles') >= '2025-08-01'
  AND DATE(time, 'America/Los_Angeles') < '2025-09-16'
QUALIFY
  dup_group_size > 1
ORDER BY
  DATE(time, 'America/Los_Angeles'),
  event_name,
  time

The window function COUNT(*) OVER (PARTITION BY event_name, time, distinct_id, insert_id) counts how many rows in the result set share the same four-field combination. This count is attached to every row as dup_group_size.

The QUALIFY dup_group_size > 1 clause filters to only rows where that count is greater than one — meaning only rows that have at least one duplicate. The result is a dataset of all duplicate events, with every copy of each duplicate included so you can inspect the full rows.

Run this query when you’re first setting up a pipeline to understand your baseline duplication rate. Run it again after any pipeline behavior changes to see whether duplication patterns have shifted. The output tells you which events duplicate most frequently and gives you the raw rows to inspect if you need to understand why.

A dup_group_size of 2 means that event appears twice. A value of 3 means it appears three times. Events with higher duplication counts are worth investigating — they may indicate a pipeline configuration issue or a specific ingestion pattern worth understanding.

Building These Into Your Data Models

These four queries are starting points, not one-off scripts. The right way to use them in practice is to embed the deduplication and identity resolution logic into a base data model or view that every downstream query inherits from.

A practical structure for most teams looks like this:

Create a base events view that applies the timezone conversion and deduplication using the CONCAT distinct count approach. Create a resolved events view that joins the base events to the identity mappings table and applies the IFNULL fallback. Point all analyst-facing queries and dashboard connections at the resolved events view rather than the raw mp_master_event table directly.

This way, no analyst has to remember to apply deduplication or identity resolution — it’s already done in the foundation, and any query they write on top of the view gets the right behavior automatically. When Mixpanel updates the identity mappings table or new event types arrive with different duplicate characteristics, you fix it once in the base model and everything downstream updates.

The alternative — where each analyst writes their own deduplication logic in their own queries — guarantees inconsistency. Two dashboards counting the same metric will eventually produce different numbers, and tracking down why will cost more time than building the shared foundation correctly from the start.

A Few Things to Keep in Mind

Timezone strings are case-sensitive in BigQuery. America/Los_Angeles works. america/los_angeles does not. Use the IANA timezone database format exactly.

Date range filtering must use the converted timestamp, not the raw UTC time. If your WHERE clause uses time >= '2025-08-01' without the timezone conversion, you’re filtering on UTC midnight, not local midnight. Always wrap time references in the DATE(time, 'your_timezone') function when filtering by date.

The identity mappings table uses a view, not a base table. The query joins to mp_identity_mappings_data_view, which is a view Mixpanel creates over the underlying identity mappings data. Use the view name, not mp_identity_mappings_data, to get the correctly structured data.

These queries are written for BigQuery syntax. Snowflake, Redshift, and Databricks use different functions for timezone conversion and JSON extraction. The logic is identical but the syntax differs. Mixpanel’s documentation has Snowflake-specific versions of the event counting and identity mapping queries if your warehouse is Snowflake.

The deduplication composite key has no separator. CONCAT(event_name, time, distinct_id, insert_id) concatenates the four fields directly with no delimiter. In theory, if event names or IDs contained substrings that could produce the same concatenation from different underlying values, you’d get false deduplication. In practice this doesn’t happen with Mixpanel’s data — insert_id is a UUID that makes collisions essentially impossible — but it’s worth knowing the approach if you’re reviewing the query logic.

Summary

Raw Mixpanel pipeline data in BigQuery requires three adjustments before your numbers mean anything: timezone conversion so date boundaries match your project’s local time, deduplication using the four-field composite key so duplicate export rows don’t inflate counts, and identity resolution so unique user counts correctly merge anonymous and authenticated sessions.

Every query you build on Mixpanel warehouse data should start from these patterns. The total daily count with deduplication is your baseline event metric. The unique user count with identity resolution is your baseline user metric. The duplicate identification query is your diagnostic tool for understanding data quality. The top events by volume is your starting point for governance work.

Get these four queries working correctly in your environment, embed the deduplication and identity resolution logic into a shared base model, and everything you build on top will produce numbers that actually match what your team sees in the Mixpanel UI.