{ "cells": [ { "cell_type": "markdown", "id": "9c7c1b1e", "metadata": {}, "source": [ "# FiveTran Data Loaders\n", "\n", "This notebook demonstrates how to use PyMC-Marketing's Fivetran data loaders to quickly prepare data for MMM modeling.\n", "\n", "If you're using Fivetran to sync data from various sources (like Shopify, Google Ads, Facebook Ads, etc.) and have implemented the corresponding dbt packages, you can leverage our data loaders to streamline your MMM workflow.\n", "\n", "## Pre-requisites\n", "\n", "1. **Fivetran dbt packages**: Ensure you have the relevant schemas configured:\n", " - [dbt_shopify](https://github.com/fivetran/dbt_shopify?tab=readme-ov-file#step-3-define-database-and-schema-variables) for e-commerce data\n", " - [dbt_ad_reporting](https://github.com/fivetran/dbt_ad_reporting?tab=readme-ov-file#union-multiple-connections) for advertising data across multiple platforms\n", "\n", "2. **Database queries**: Create queries from your database that follow the Fivetran dbt schema structure. The dbt packages standardize the output format, making it easy to work with data from multiple sources.\n", "\n", "## Workflow\n", "\n", "Once your data follows the Fivetran dbt schema, you can:\n", "1. Use our data loaders to automatically process and format your data\n", "2. Generate X (media channels) and y (target variable) datasets\n", "3. Quickly train MMM models with minimal data preprocessing\n", "\n", "Let's see this in action:" ] }, { "cell_type": "code", "execution_count": null, "id": "bd5ee2ed", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from pymc_marketing.data.fivetran import (\n", " process_fivetran_ad_reporting,\n", " process_fivetran_shopify_unique_orders,\n", ")\n", "from pymc_marketing.mmm.builders.yaml import build_mmm_from_yaml\n", "from pymc_marketing.paths import data_dir" ] }, { "cell_type": "markdown", "id": "508a1e72", "metadata": {}, "source": [ "## Loading Data from Fivetran dbt Schema\n", "\n", "In a real-world scenario, you would typically query your database directly using the Fivetran dbt schema structure. For example:\n", "\n", "```python\n", "import pandas as pd\n", "from sqlalchemy import create_engine\n", "\n", "Create your database connection\n", "engine = create_engine('your_database_connection_string')\n", "\n", "Query the standardized ad reporting table\n", "query = \"SELECT * FROM .ad_reporting__ad_report\"\n", "ad_data = pd.read_sql(query, engine)\n", "```\n", "\n", "As mentioned, the `ad_reporting__ad_report` table is created by the [dbt_ad_reporting](https://github.com/fivetran/dbt_ad_reporting).\n", "\n", "For this demonstration, we're importing a CSV file that follows the exact same schema structure.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "98d2db85", "metadata": {}, "outputs": [], "source": [ "x_data = pd.read_csv(data_dir / \"fivetran_examples/ad_report_schema.csv\")\n", "x_data.head()" ] }, { "cell_type": "code", "execution_count": 3, "id": "fd728890", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 16 entries, 0 to 15\n", "Data columns (total 16 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 source_relation 16 non-null object \n", " 1 date_day 16 non-null object \n", " 2 platform 16 non-null object \n", " 3 account_id 16 non-null object \n", " 4 account_name 16 non-null object \n", " 5 campaign_id 16 non-null object \n", " 6 campaign_name 16 non-null object \n", " 7 ad_group_id 16 non-null object \n", " 8 ad_group_name 16 non-null object \n", " 9 ad_id 16 non-null object \n", " 10 ad_name 16 non-null object \n", " 11 clicks 16 non-null int64 \n", " 12 impressions 16 non-null int64 \n", " 13 spend 16 non-null float64\n", " 14 conversions 16 non-null int64 \n", " 15 conversions_value 16 non-null float64\n", "dtypes: float64(2), int64(3), object(11)\n", "memory usage: 2.1+ KB\n" ] } ], "source": [ "x_data.info()" ] }, { "cell_type": "markdown", "id": "d3b6d647", "metadata": {}, "source": [ "## Data Processing\n", " \n", "From this raw data, we don't need all the columns and granular details. The data is currently in long format (one row per ad/campaign/day combination), but for MMM modeling we need it in wide format with aggregated metrics by platform and date.\n", " \n", "We can quickly transform this using the `process_fivetran_ad_reporting` function, which will:\n", "- Aggregate the data by platform and date\n", "- Convert from long to wide format\n", "- Keep only the essential columns needed for MMM modeling\n", "\n", "> [!NOTE] \n", "> Remember tables like this one usually contain information from all countries and types of media activities, by consequence the idea should be to filter by the type of media activity (e.g: awareness, retargeting, etc) and country you want to analyse before apply the data helper functions, **otherwise you'll aggregate across all type of media and markets.**" ] }, { "cell_type": "code", "execution_count": 4, "id": "04daa712", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datefacebook_ads_impressionsgoogle_ads_impressionstiktok_ads_impressions
02025-07-1030000.08800.021000.0
12025-07-1131000.09200.022000.0
22025-07-1232000.09500.00.0
\n", "
" ], "text/plain": [ " date facebook_ads_impressions google_ads_impressions \\\n", "0 2025-07-10 30000.0 8800.0 \n", "1 2025-07-11 31000.0 9200.0 \n", "2 2025-07-12 32000.0 9500.0 \n", "\n", " tiktok_ads_impressions \n", "0 21000.0 \n", "1 22000.0 \n", "2 0.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = process_fivetran_ad_reporting(\n", " df=x_data,\n", ")\n", "x.head()" ] }, { "cell_type": "markdown", "id": "9dfa023a", "metadata": {}, "source": [ "by default we are using impressions, but we could decide to model with **spend** and it's easy to change using a single parameter." ] }, { "cell_type": "code", "execution_count": 5, "id": "f62d7768", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datefacebook_ads_spendgoogle_ads_spendtiktok_ads_spend
02025-07-10381.5185.75229.5
12025-07-11396.0197.50242.6
22025-07-12410.5203.800.0
\n", "
" ], "text/plain": [ " date facebook_ads_spend google_ads_spend tiktok_ads_spend\n", "0 2025-07-10 381.5 185.75 229.5\n", "1 2025-07-11 396.0 197.50 242.6\n", "2 2025-07-12 410.5 203.80 0.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_spend = process_fivetran_ad_reporting(df=x_data, value_columns=\"spend\")\n", "x_spend.head()" ] }, { "cell_type": "markdown", "id": "7043ab53", "metadata": {}, "source": [ "You can modify a few things, changing the columns to pivot, the operation to run at the moment of aggregation, among others. You can take a look to the full function here:" ] }, { "cell_type": "code", "execution_count": 6, "id": "1c3be267", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mSignature:\u001b[39m\n", "process_fivetran_ad_reporting(\n", " df: pandas.core.frame.DataFrame,\n", " value_columns: str | collections.abc.Sequence[str] = \u001b[33m'impressions'\u001b[39m,\n", " *,\n", " date_col: str = \u001b[33m'date_day'\u001b[39m,\n", " platform_col: str = \u001b[33m'platform'\u001b[39m,\n", " agg: str = \u001b[33m'sum'\u001b[39m,\n", " fill_value: float | \u001b[38;5;28;01mNone\u001b[39;00m = \u001b[32m0.0\u001b[39m,\n", " include_missing_dates: bool = \u001b[38;5;28;01mFalse\u001b[39;00m,\n", " freq: str = \u001b[33m'D'\u001b[39m,\n", " rename_date_to: str | \u001b[38;5;28;01mNone\u001b[39;00m = \u001b[33m'date'\u001b[39m,\n", ") -> pandas.core.frame.DataFrame\n", "\u001b[31mDocstring:\u001b[39m\n", "Process Fivetran's Ad Reporting schema's.\n", "\n", "Compatible with Fivetran's Ad Reporting schema:\n", "- ad_reporting__account_report: Each record represents daily metrics by account\n", "- ad_reporting__campaign_report: Each record represents daily metrics by campaign and account\n", "- ad_reporting__ad_group_report: Each record represents daily metrics by ad group, campaign and account\n", "- ad_reporting__ad_report: Each record represents daily metrics by ad, ad group, campaign and account\n", "\n", "The input data is expected to contain at least the following columns: a date column\n", "(default: ``date_day``), a platform column (default: ``platform``), and one or more\n", "metric columns such as ``spend`` or ``impressions``.\n", "\n", "Parameters\n", "----------\n", "df\n", " Input DataFrame in long format.\n", "value_columns\n", " A single column name or a sequence of column names to aggregate and pivot. For\n", " example: \"spend\" or [\"spend\", \"impressions\"].\n", "date_col\n", " Name of the date column. Defaults to \"date_day\".\n", "platform_col\n", " Name of the platform column. Defaults to \"platform\".\n", "agg\n", " Aggregation method to apply during groupby (e.g., \"sum\", \"mean\"). Defaults to \"sum\".\n", "fill_value\n", " Value to use to fill missing values in the wide output. If None, missing values\n", " are left as NaN.\n", "include_missing_dates\n", " If True, the output will include a continuous date index from the min to the max\n", " date found in the input, with missing dates filled (using ``fill_value``).\n", "freq\n", " Frequency used when ``include_missing_dates`` is True. Defaults to daily (\"D\").\n", "rename_date_to\n", " If provided, the date column in the result will be renamed to this value (e.g.,\n", " \"date\"). If None, the original ``date_col`` name is kept.\n", "\n", "Returns\n", "-------\n", "pd.DataFrame\n", " A wide-format DataFrame with one row per date and columns for each\n", " ``{platform}_{metric}`` combination.\n", "\u001b[31mFile:\u001b[39m ~/Documents/GitHub/pymc-marketing/pymc_marketing/data/fivetran.py\n", "\u001b[31mType:\u001b[39m function" ] } ], "source": [ "process_fivetran_ad_reporting?" ] }, { "cell_type": "markdown", "id": "1383209f", "metadata": {}, "source": [ "All `ad_reporting` tables can provide information about your drivers (media channels) which affect your target metric. But how can you get information from your target?\n", "\n", "Currently, FiveTran allows you to get information from Shopify, using [dbt_shopify](https://github.com/fivetran/dbt_shopify?tab=readme-ov-file#step-3-define-database-and-schema-variables). You can use the data loaders to access the `shopify__orders` schema and then transform the dataset to get the output ready for your media mix model." ] }, { "cell_type": "code", "execution_count": 7, "id": "75a37210", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
order_iduser_idtotal_discountstotal_discounts_settotal_line_items_pricetotal_line_items_price_settotal_pricetotal_price_settotal_tax_settotal_tax...count_discount_codes_appliedorder_total_shipping_taxorder_tagsorder_url_tagsnumber_of_fulfillmentsfulfillment_servicestracking_companiestracking_numberscustomer_order_seq_numbernew_vs_repeat
0701310219.6{\"shop_money\":{\"amount\":\"19.60\",\"currency_code...96.0{\"shop_money\":{\"amount\":\"96.00\",\"currency_code...97.03{\"shop_money\":{\"amount\":\"97.03\",\"currency_code...{\"shop_money\":{\"amount\":\"8.64\",\"currency_code\"...8.64...01.0Welcome, Refundutm_campaign:summer, utm_source:email2deliverrUSPS, UPSTRK701311new
170141001.2{\"shop_money\":{\"amount\":\"1.20\",\"currency_code\"...24.0{\"shop_money\":{\"amount\":\"24.00\",\"currency_code...28.13{\"shop_money\":{\"amount\":\"28.13\",\"currency_code...{\"shop_money\":{\"amount\":\"1.34\",\"currency_code\"...1.34...20.2Welcome, Refundutm_campaign:summer, utm_source:email2deliverrUSPS, UPSTRK70141, TRK701422repeat
270151017.4{\"shop_money\":{\"amount\":\"7.40\",\"currency_code\"...48.0{\"shop_money\":{\"amount\":\"48.00\",\"currency_code...42.63{\"shop_money\":{\"amount\":\"42.63\",\"currency_code...{\"shop_money\":{\"amount\":\"2.03\",\"currency_code\"...2.03...00.0Promo, Welcomeutm_campaign:summer, utm_source:email2manualUPS, USPSTRK701513repeat
370161005.0{\"shop_money\":{\"amount\":\"5.00\",\"currency_code\"...72.0{\"shop_money\":{\"amount\":\"72.00\",\"currency_code...74.50{\"shop_money\":{\"amount\":\"74.50\",\"currency_code...{\"shop_money\":{\"amount\":\"0.00\",\"currency_code\"...0.00...00.0Promo, Refundutm_campaign:summer, utm_source:email1shippoUSPS, UPSTRK70161, TRK701624repeat
4700110213.6{\"shop_money\":{\"amount\":\"13.60\",\"currency_code...72.0{\"shop_money\":{\"amount\":\"72.00\",\"currency_code...60.40{\"shop_money\":{\"amount\":\"60.40\",\"currency_code...{\"shop_money\":{\"amount\":\"0.00\",\"currency_code\"...0.00...00.0VIP, Giftutm_campaign:summer, utm_source:email1deliverrUSPS, FedExTRK70011, TRK700121new
\n", "

5 rows × 101 columns

\n", "
" ], "text/plain": [ " order_id user_id total_discounts \\\n", "0 7013 102 19.6 \n", "1 7014 100 1.2 \n", "2 7015 101 7.4 \n", "3 7016 100 5.0 \n", "4 7001 102 13.6 \n", "\n", " total_discounts_set total_line_items_price \\\n", "0 {\"shop_money\":{\"amount\":\"19.60\",\"currency_code... 96.0 \n", "1 {\"shop_money\":{\"amount\":\"1.20\",\"currency_code\"... 24.0 \n", "2 {\"shop_money\":{\"amount\":\"7.40\",\"currency_code\"... 48.0 \n", "3 {\"shop_money\":{\"amount\":\"5.00\",\"currency_code\"... 72.0 \n", "4 {\"shop_money\":{\"amount\":\"13.60\",\"currency_code... 72.0 \n", "\n", " total_line_items_price_set total_price \\\n", "0 {\"shop_money\":{\"amount\":\"96.00\",\"currency_code... 97.03 \n", "1 {\"shop_money\":{\"amount\":\"24.00\",\"currency_code... 28.13 \n", "2 {\"shop_money\":{\"amount\":\"48.00\",\"currency_code... 42.63 \n", "3 {\"shop_money\":{\"amount\":\"72.00\",\"currency_code... 74.50 \n", "4 {\"shop_money\":{\"amount\":\"72.00\",\"currency_code... 60.40 \n", "\n", " total_price_set \\\n", "0 {\"shop_money\":{\"amount\":\"97.03\",\"currency_code... \n", "1 {\"shop_money\":{\"amount\":\"28.13\",\"currency_code... \n", "2 {\"shop_money\":{\"amount\":\"42.63\",\"currency_code... \n", "3 {\"shop_money\":{\"amount\":\"74.50\",\"currency_code... \n", "4 {\"shop_money\":{\"amount\":\"60.40\",\"currency_code... \n", "\n", " total_tax_set total_tax ... \\\n", "0 {\"shop_money\":{\"amount\":\"8.64\",\"currency_code\"... 8.64 ... \n", "1 {\"shop_money\":{\"amount\":\"1.34\",\"currency_code\"... 1.34 ... \n", "2 {\"shop_money\":{\"amount\":\"2.03\",\"currency_code\"... 2.03 ... \n", "3 {\"shop_money\":{\"amount\":\"0.00\",\"currency_code\"... 0.00 ... \n", "4 {\"shop_money\":{\"amount\":\"0.00\",\"currency_code\"... 0.00 ... \n", "\n", " count_discount_codes_applied order_total_shipping_tax order_tags \\\n", "0 0 1.0 Welcome, Refund \n", "1 2 0.2 Welcome, Refund \n", "2 0 0.0 Promo, Welcome \n", "3 0 0.0 Promo, Refund \n", "4 0 0.0 VIP, Gift \n", "\n", " order_url_tags number_of_fulfillments \\\n", "0 utm_campaign:summer, utm_source:email 2 \n", "1 utm_campaign:summer, utm_source:email 2 \n", "2 utm_campaign:summer, utm_source:email 2 \n", "3 utm_campaign:summer, utm_source:email 1 \n", "4 utm_campaign:summer, utm_source:email 1 \n", "\n", " fulfillment_services tracking_companies tracking_numbers \\\n", "0 deliverr USPS, UPS TRK70131 \n", "1 deliverr USPS, UPS TRK70141, TRK70142 \n", "2 manual UPS, USPS TRK70151 \n", "3 shippo USPS, UPS TRK70161, TRK70162 \n", "4 deliverr USPS, FedEx TRK70011, TRK70012 \n", "\n", " customer_order_seq_number new_vs_repeat \n", "0 1 new \n", "1 2 repeat \n", "2 3 repeat \n", "3 4 repeat \n", "4 1 new \n", "\n", "[5 rows x 101 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_data = pd.read_csv(data_dir / \"fivetran_examples/shopify_orders_schema.csv\")\n", "y_data.head()" ] }, { "cell_type": "code", "execution_count": 8, "id": "8352c96a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 24 entries, 0 to 23\n", "Data columns (total 101 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 order_id 24 non-null int64 \n", " 1 user_id 24 non-null int64 \n", " 2 total_discounts 24 non-null float64\n", " 3 total_discounts_set 24 non-null object \n", " 4 total_line_items_price 24 non-null float64\n", " 5 total_line_items_price_set 24 non-null object \n", " 6 total_price 24 non-null float64\n", " 7 total_price_set 24 non-null object \n", " 8 total_tax_set 24 non-null object \n", " 9 total_tax 24 non-null float64\n", " 10 source_name 24 non-null object \n", " 11 subtotal_price 24 non-null float64\n", " 12 has_taxes_included 24 non-null bool \n", " 13 total_weight 24 non-null int64 \n", " 14 total_tip_received 24 non-null float64\n", " 15 landing_site_base_url 24 non-null object \n", " 16 location_id 24 non-null int64 \n", " 17 name 24 non-null object \n", " 18 note 11 non-null object \n", " 19 number 24 non-null int64 \n", " 20 order_number 24 non-null int64 \n", " 21 cancel_reason 12 non-null float64\n", " 22 cart_token 24 non-null object \n", " 23 checkout_token 24 non-null object \n", " 24 created_timestamp 24 non-null object \n", " 25 cancelled_timestamp 12 non-null object \n", " 26 closed_timestamp 24 non-null object \n", " 27 processed_timestamp 24 non-null object \n", " 28 updated_timestamp 24 non-null object \n", " 29 currency 24 non-null object \n", " 30 customer_id 24 non-null int64 \n", " 31 email 24 non-null object \n", " 32 financial_status 24 non-null object \n", " 33 fulfillment_status 24 non-null object \n", " 34 referring_site 24 non-null object \n", " 35 billing_address_address_1 24 non-null object \n", " 36 billing_address_address_2 24 non-null object \n", " 37 billing_address_city 24 non-null object \n", " 38 billing_address_company 0 non-null float64\n", " 39 billing_address_country 24 non-null object \n", " 40 billing_address_country_code 24 non-null object \n", " 41 billing_address_first_name 24 non-null object \n", " 42 billing_address_last_name 24 non-null object \n", " 43 billing_address_latitude 24 non-null float64\n", " 44 billing_address_longitude 24 non-null float64\n", " 45 billing_address_name 24 non-null object \n", " 46 billing_address_phone 24 non-null object \n", " 47 billing_address_province 24 non-null object \n", " 48 billing_address_province_code 24 non-null int64 \n", " 49 billing_address_zip 24 non-null int64 \n", " 50 browser_ip 24 non-null object \n", " 51 total_shipping_price_set 24 non-null object \n", " 52 shipping_address_address_1 24 non-null object \n", " 53 shipping_address_address_2 0 non-null float64\n", " 54 shipping_address_city 24 non-null object \n", " 55 shipping_address_company 0 non-null float64\n", " 56 shipping_address_country 24 non-null object \n", " 57 shipping_address_country_code 24 non-null object \n", " 58 shipping_address_first_name 24 non-null object \n", " 59 shipping_address_last_name 24 non-null object \n", " 60 shipping_address_latitude 24 non-null float64\n", " 61 shipping_address_longitude 24 non-null float64\n", " 62 shipping_address_name 24 non-null object \n", " 63 shipping_address_phone 24 non-null object \n", " 64 shipping_address_province 24 non-null object \n", " 65 shipping_address_province_code 24 non-null int64 \n", " 66 shipping_address_zip 24 non-null int64 \n", " 67 token 24 non-null object \n", " 68 app_id 24 non-null int64 \n", " 69 checkout_id 24 non-null int64 \n", " 70 client_details_user_agent 24 non-null object \n", " 71 customer_locale 24 non-null object \n", " 72 order_status_url 24 non-null object \n", " 73 presentment_currency 24 non-null object \n", " 74 is_test_order 24 non-null bool \n", " 75 is_deleted 24 non-null bool \n", " 76 has_buyer_accepted_marketing 24 non-null bool \n", " 77 is_confirmed 24 non-null bool \n", " 78 _fivetran_synced 24 non-null object \n", " 79 source_relation 24 non-null object \n", " 80 orders_unique_key 24 non-null object \n", " 81 shipping_cost 24 non-null float64\n", " 82 order_adjustment_amount 24 non-null float64\n", " 83 order_adjustment_tax_amount 24 non-null float64\n", " 84 refund_subtotal 24 non-null float64\n", " 85 refund_total_tax 24 non-null float64\n", " 86 order_adjusted_total 24 non-null float64\n", " 87 line_item_count 24 non-null int64 \n", " 88 shipping_discount_amount 24 non-null float64\n", " 89 percentage_calc_discount_amount 24 non-null float64\n", " 90 fixed_amount_discount_amount 24 non-null float64\n", " 91 count_discount_codes_applied 24 non-null int64 \n", " 92 order_total_shipping_tax 24 non-null float64\n", " 93 order_tags 24 non-null object \n", " 94 order_url_tags 24 non-null object \n", " 95 number_of_fulfillments 24 non-null int64 \n", " 96 fulfillment_services 24 non-null object \n", " 97 tracking_companies 24 non-null object \n", " 98 tracking_numbers 24 non-null object \n", " 99 customer_order_seq_number 24 non-null int64 \n", " 100 new_vs_repeat 24 non-null object \n", "dtypes: bool(5), float64(24), int64(17), object(55)\n", "memory usage: 18.2+ KB\n" ] } ], "source": [ "y_data.info(verbose=True, show_counts=True)" ] }, { "cell_type": "markdown", "id": "3c8a0d77", "metadata": {}, "source": [ "As with the previous examples, we're using a CSV file for demonstration purposes. However, in a real-world scenario, you would query the proper schema and data table directly from your Fivetran-connected database to get an output like this one.\n", "\n", "```python\n", "# Query the standardized ad reporting table\n", "query = \"SELECT * FROM .shopify__orders\"\n", "shopify_data = pd.read_sql(query, engine)\n", "```\n", "\n", "Once you get the data from it, you can apply the data loaders in order to simplify your media mix model pipeline.\n", "\n", "> [!NOTE] \n", "> Nevertheless, remember tables like this one usually contain information from all countries or type of transactions, by consequence the idea should be to filter by the type of transactions and country you want to analyse before apply the data helper functions, **otherwise you'll aggregate across all type of transactions and markets.**" ] }, { "cell_type": "code", "execution_count": 9, "id": "84b1b6b3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateorders
02025-07-101
12025-07-112
22025-07-124
32025-07-135
42025-07-145
\n", "
" ], "text/plain": [ " date orders\n", "0 2025-07-10 1\n", "1 2025-07-11 2\n", "2 2025-07-12 4\n", "3 2025-07-13 5\n", "4 2025-07-14 5" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = process_fivetran_shopify_unique_orders(\n", " df=y_data,\n", ")\n", "y.head()" ] }, { "cell_type": "markdown", "id": "061c3bf1", "metadata": {}, "source": [ "You can check the full function to validate all things which you could possibly adjust." ] }, { "cell_type": "code", "execution_count": 10, "id": "0cb511f7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mSignature:\u001b[39m\n", "process_fivetran_shopify_unique_orders(\n", " df: pandas.core.frame.DataFrame,\n", " *,\n", " date_col: str = \u001b[33m'processed_timestamp'\u001b[39m,\n", " order_key_col: str = \u001b[33m'orders_unique_key'\u001b[39m,\n", " rename_date_to: str = \u001b[33m'date'\u001b[39m,\n", ") -> pandas.core.frame.DataFrame\n", "\u001b[31mDocstring:\u001b[39m\n", "Compute daily unique order counts from a (pre-filtered) Shopify orders dataset.\n", "\n", "This function is designed for data following the Fivetran Shopify orders schema\n", "(e.g., ``shopify__orders``). It assumes the input ``df`` is already filtered to\n", "the desired subset (e.g., non-canceled, US-delivery, new-only orders).\n", "\n", "Parameters\n", "----------\n", "df\n", " Input DataFrame following the Shopify orders schema.\n", "date_col\n", " Column to derive the daily bucket from. Defaults to \"processed_timestamp\".\n", "order_key_col\n", " Unique order identifier column. Defaults to \"orders_unique_key\".\n", "rename_date_to\n", " Name of the date column in the result. Defaults to \"date\".\n", "\n", "Returns\n", "-------\n", "pd.DataFrame\n", " A DataFrame with two columns: ``rename_date_to`` and ``orders``, where\n", " ``orders`` is the unique order count per day.\n", "\u001b[31mFile:\u001b[39m ~/Documents/GitHub/pymc-marketing/pymc_marketing/data/fivetran.py\n", "\u001b[31mType:\u001b[39m function" ] } ], "source": [ "process_fivetran_shopify_unique_orders?" ] }, { "cell_type": "code", "execution_count": 11, "id": "b6091ee2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datefacebook_ads_impressionsgoogle_ads_impressionstiktok_ads_impressionsorders
02025-07-1030000.08800.021000.01
12025-07-1131000.09200.022000.02
22025-07-1232000.09500.00.04
\n", "
" ], "text/plain": [ " date facebook_ads_impressions google_ads_impressions \\\n", "0 2025-07-10 30000.0 8800.0 \n", "1 2025-07-11 31000.0 9200.0 \n", "2 2025-07-12 32000.0 9500.0 \n", "\n", " tiktok_ads_impressions orders \n", "0 21000.0 1 \n", "1 22000.0 2 \n", "2 0.0 4 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = x.merge(y, on=\"date\", how=\"left\").fillna(0)\n", "data.head()" ] }, { "cell_type": "markdown", "id": "04e26582", "metadata": {}, "source": [ "lternative way to load data using the data pre-processing functions\n", "```python\n", "# data = process_fivetran_ad_reporting(\n", "# df=x_data,\n", "# ).merge(\n", "# process_fivetran_shopify_unique_orders(\n", "# df=y_data,\n", "# ),\n", "# on=\"date\",\n", "# how=\"left\",\n", "# )\n", "\n", "# data.head()\n", "```" ] }, { "cell_type": "markdown", "id": "49bf19f4", "metadata": {}, "source": [ "Once all your data it's prepare, you can train your model 🙌🏻" ] }, { "cell_type": "code", "execution_count": 13, "id": "235dcbb7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/carlostrujillo/Documents/GitHub/pymc-marketing/pymc_marketing/mmm/builders/yaml.py:104: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " X[date_column] = pd.to_datetime(X[date_column])\n" ] } ], "source": [ "mmm = build_mmm_from_yaml(\n", " X=data[\n", " [\n", " \"date\",\n", " \"facebook_ads_impressions\",\n", " \"google_ads_impressions\",\n", " \"tiktok_ads_impressions\",\n", " ]\n", " ],\n", " y=data[\"orders\"],\n", " config_path=data_dir / \"config_files\" / \"multi_dimensional_fivetran.yml\",\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "id": "efc9a31f", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "clusterchannel (3)\n", "\n", "channel (3)\n", "\n", "\n", "clusterdate (3) x channel (3)\n", "\n", "date (3) x channel (3)\n", "\n", "\n", "clusterdate (3)\n", "\n", "date (3)\n", "\n", "\n", "\n", "channel_scale\n", "\n", "channel_scale\n", "~\n", "Data\n", "\n", "\n", "\n", "channel_contribution\n", "\n", "channel_contribution\n", "~\n", "Deterministic\n", "\n", "\n", "\n", "channel_scale->channel_contribution\n", "\n", "\n", "\n", "\n", "\n", "saturation_lam\n", "\n", "saturation_lam\n", "~\n", "Halfnormal\n", "\n", "\n", "\n", "saturation_lam->channel_contribution\n", "\n", "\n", "\n", "\n", "\n", "target_scale\n", "\n", "target_scale\n", "~\n", "Data\n", "\n", "\n", "\n", "total_media_contribution_original_scale\n", "\n", "total_media_contribution_original_scale\n", "~\n", "Deterministic\n", "\n", "\n", "\n", "target_scale->total_media_contribution_original_scale\n", "\n", "\n", "\n", "\n", "\n", "adstock_alpha\n", "\n", "adstock_alpha\n", "~\n", "Beta\n", "\n", "\n", "\n", "adstock_alpha->channel_contribution\n", "\n", "\n", "\n", "\n", "\n", "intercept_contribution\n", "\n", "intercept_contribution\n", "~\n", "Normal\n", "\n", "\n", "\n", "y\n", "\n", "y\n", "~\n", "Normal\n", "\n", "\n", "\n", "intercept_contribution->y\n", "\n", "\n", "\n", "\n", "\n", "saturation_alpha\n", "\n", "saturation_alpha\n", "~\n", "Gamma\n", "\n", "\n", "\n", "saturation_alpha->channel_contribution\n", "\n", "\n", "\n", "\n", "\n", "y_sigma\n", "\n", "y_sigma\n", "~\n", "Halfnormal\n", "\n", "\n", "\n", "y_sigma->y\n", "\n", "\n", "\n", "\n", "\n", "channel_data\n", "\n", "channel_data\n", "~\n", "Data\n", "\n", "\n", "\n", "channel_data->channel_contribution\n", "\n", "\n", "\n", "\n", "\n", "channel_contribution->total_media_contribution_original_scale\n", "\n", "\n", "\n", "\n", "\n", "channel_contribution->y\n", "\n", "\n", "\n", "\n", "\n", "target_data\n", "\n", "target_data\n", "~\n", "Data\n", "\n", "\n", "\n", "y->target_scale\n", "\n", "\n", "\n", "\n", "\n", "y->target_data\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mmm.model.to_graphviz()" ] }, { "cell_type": "code", "execution_count": 15, "id": "7e00bb5b", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Sampling: [adstock_alpha, intercept_contribution, saturation_alpha, saturation_lam, y, y_sigma]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 24kB\n",
       "Dimensions:  (date: 3, sample: 500)\n",
       "Coordinates:\n",
       "  * date     (date) datetime64[ns] 24B 2025-07-10 2025-07-11 2025-07-12\n",
       "  * sample   (sample) object 4kB MultiIndex\n",
       "  * chain    (sample) int64 4kB 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0\n",
       "  * draw     (sample) int64 4kB 0 1 2 3 4 5 6 7 ... 493 494 495 496 497 498 499\n",
       "Data variables:\n",
       "    y        (date, sample) float64 12kB 2.442 14.47 10.69 ... 3.808 3.132 5.092\n",
       "Attributes:\n",
       "    created_at:                 2025-08-14T19:24:17.732780+00:00\n",
       "    arviz_version:              0.22.0\n",
       "    inference_library:          pymc\n",
       "    inference_library_version:  5.25.1\n",
       "    pymc_marketing_version:     0.15.1
" ], "text/plain": [ " Size: 24kB\n", "Dimensions: (date: 3, sample: 500)\n", "Coordinates:\n", " * date (date) datetime64[ns] 24B 2025-07-10 2025-07-11 2025-07-12\n", " * sample (sample) object 4kB MultiIndex\n", " * chain (sample) int64 4kB 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0\n", " * draw (sample) int64 4kB 0 1 2 3 4 5 6 7 ... 493 494 495 496 497 498 499\n", "Data variables:\n", " y (date, sample) float64 12kB 2.442 14.47 10.69 ... 3.808 3.132 5.092\n", "Attributes:\n", " created_at: 2025-08-14T19:24:17.732780+00:00\n", " arviz_version: 0.22.0\n", " inference_library: pymc\n", " inference_library_version: 5.25.1\n", " pymc_marketing_version: 0.15.1" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mmm.sample_prior_predictive(\n", " X=data[\n", " [\n", " \"date\",\n", " \"facebook_ads_impressions\",\n", " \"google_ads_impressions\",\n", " \"tiktok_ads_impressions\",\n", " ]\n", " ],\n", " y=data[\"orders\"],\n", ")" ] }, { "cell_type": "markdown", "id": "1b2b5e30", "metadata": {}, "source": [ "Just like that, you are ready to fit your model and unlock different insights from your data. If you want to know what to explore next, then take a look to the followinc documents:\n", "\n", "1. [Multidimensional MMM Example (From Zero to Hero)](https://www.pymc-marketing.io/en/latest/notebooks/mmm/mmm_multidimensional_example.html).\n", "2. [Marketing Mix Models and Budget allocation](https://www.pymc-marketing.io/en/latest/notebooks/mmm/mmm_budget_allocation_example.html).\n", "3. [Risk allocation with Marketing Mix Models](https://www.pymc-marketing.io/en/latest/notebooks/mmm/mmm_allocation_assessment.html)." ] }, { "cell_type": "code", "execution_count": 16, "id": "e98d768b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last updated: Thu Aug 14 2025\n", "\n", "Python implementation: CPython\n", "Python version : 3.12.11\n", "IPython version : 9.4.0\n", "\n", "pytensor: 2.31.7\n", "\n", "pymc_marketing: 0.15.1\n", "pandas : 2.3.1\n", "\n", "Watermark: 2.5.0\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -n -u -v -iv -w -p pytensor" ] } ], "metadata": { "kernelspec": { "display_name": "pymc-marketing-dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 5 }