Cross-Database Joins

Overview

Pylar allows you to work with Local Dataframes—data that’s already loaded in your Pylar session. This enables you to query and join multiple dataframes using SQL, creating unified views without needing to connect to external databases.

What Are Local Dataframes?

Local Dataframes are:

Data tables loaded into your Pylar session
Accessible through the SQL IDE
Queryable using standard SQL syntax
Joinable across multiple dataframes

Think of dataframes as temporary tables that exist in your current Pylar session. They’re perfect for combining data from multiple sources or working with intermediate results.

Setting Up Dataframe Queries

To query local dataframes:

Select Dataframe Source: In the SQL IDE, set the datasource dropdown to “Local Dataframes”
Reference Dataframe Names: Use the dataframe name as the table name in your SQL query
Write Standard SQL: Use familiar SQL syntax to query and join your dataframes

Dataframe names are case-sensitive and must match exactly. Make sure you know the exact names of your loaded dataframes before querying.

Basic Dataframe Queries

Selecting from a Single Dataframe

Query a specific subset of data from a dataframe:

/* Selecting a subset of events from table0 */
SELECT * 
FROM table0 
WHERE event_type = 'login';

This query:

References table0 as a dataframe in your session
Filters rows where event_type equals 'login'
Returns all columns (*) from matching rows

Selecting Specific Columns

SELECT 
  event_id,
  event_type,
  timestamp,
  user_id
FROM table0 
WHERE event_type = 'login'
ORDER BY timestamp DESC;

Filtering and Sorting

SELECT 
  user_id,
  event_type,
  COUNT(*) as event_count
FROM table0 
WHERE event_type IN ('login', 'logout', 'purchase')
  AND timestamp >= '2024-01-01'
GROUP BY user_id, event_type
ORDER BY event_count DESC;

Joining Multiple Dataframes

One of the most powerful features is joining data from multiple dataframes to create unified views.

Basic Join Syntax

SELECT 
  t1.column1,
  t1.column2,
  t2.column3,
  t2.column4
FROM dataframe1 t1
JOIN dataframe2 t2
  ON t1.common_id = t2.common_id;

Example: Joining on Common ID

Join two dataframes on a common identifier:

/* Joining on a common ID field between two tables */
SELECT 
  t1.event_id,
  t1.event_type,
  t1.timestamp,
  t2.user_id,
  t2.user_name,
  t2.user_email
FROM table1 t1
JOIN table2 t2 
  ON t1.event_id = t2.event_id;

This query:

Joins table1 and table2 dataframes
Matches rows where event_id values are equal
Returns selected columns from both dataframes
Uses table aliases (t1, t2) for clarity

Always use table aliases when joining dataframes. It makes your queries more readable and helps avoid column name conflicts.

Advanced Join Patterns

Multiple Dataframe Joins

Join three or more dataframes:

SELECT 
  events.event_id,
  events.event_type,
  users.user_name,
  users.user_email,
  sessions.session_id,
  sessions.session_duration
FROM table0 events
JOIN table1 users
  ON events.user_id = users.user_id
JOIN table2 sessions
  ON events.session_id = sessions.session_id
WHERE events.event_type = 'purchase';

Left Join for Optional Data

Include all rows from the first dataframe, with optional data from the second:

SELECT 
  customers.customer_id,
  customers.customer_name,
  orders.order_id,
  orders.order_amount
FROM table0 customers
LEFT JOIN table1 orders
  ON customers.customer_id = orders.customer_id;

Full Outer Join

Include all rows from both dataframes:

SELECT 
  COALESCE(contacts.contact_id, leads.lead_id) AS unified_id,
  contacts.email AS contact_email,
  leads.email AS lead_email,
  contacts.engagement_score,
  leads.lead_score
FROM table0 contacts
FULL OUTER JOIN table1 leads
  ON contacts.email = leads.email;

Real-World Examples

Example 1: Customer Event Analysis

Combine customer events with user profiles:

SELECT 
  -- Event data
  e.event_id,
  e.event_type,
  e.timestamp,
  e.event_data,
  
  -- User profile data
  u.user_id,
  u.user_name,
  u.user_email,
  u.signup_date,
  u.plan_type,
  
  -- Calculated fields
  DATEDIFF(day, u.signup_date, e.timestamp) AS days_since_signup

FROM table0 e  -- events dataframe
JOIN table1 u  -- users dataframe
  ON e.user_id = u.user_id

WHERE 
  e.event_type IN ('purchase', 'subscription_upgrade')
  AND e.timestamp >= '2024-01-01'

ORDER BY 
  e.timestamp DESC;

Example 2: Marketing Attribution

Join campaign data with conversion events:

SELECT 
  -- Campaign data
  c.campaign_id,
  c.campaign_name,
  c.channel,
  c.start_date,
  c.budget,
  
  -- Conversion data
  conv.conversion_id,
  conv.conversion_date,
  conv.conversion_value,
  conv.user_id,
  
  -- Attribution metrics
  DATEDIFF(day, c.start_date, conv.conversion_date) AS days_to_convert,
  conv.conversion_value / c.budget AS roi

FROM table0 c  -- campaigns dataframe
LEFT JOIN table1 conv  -- conversions dataframe
  ON c.campaign_id = conv.campaign_id
  AND conv.conversion_date BETWEEN c.start_date AND DATEADD(day, 30, c.start_date)

WHERE 
  c.start_date >= '2024-01-01'

ORDER BY 
  c.start_date DESC, conv.conversion_value DESC;

Example 3: Product Performance Analysis

Combine product catalog with sales and reviews:

SELECT 
  -- Product catalog
  p.product_id,
  p.product_name,
  p.category,
  p.price,
  p.launch_date,
  
  -- Sales data
  s.total_units_sold,
  s.total_revenue,
  s.avg_order_quantity,
  s.last_sale_date,
  
  -- Review data
  r.avg_rating,
  r.review_count,
  r.positive_review_pct,
  
  -- Performance metrics
  s.total_revenue / NULLIF(DATEDIFF(day, p.launch_date, CURRENT_DATE()), 0) AS daily_revenue_rate,
  s.total_units_sold * r.avg_rating AS popularity_score

FROM table0 p  -- products dataframe
LEFT JOIN (
  SELECT 
    product_id,
    SUM(quantity) AS total_units_sold,
    SUM(amount) AS total_revenue,
    AVG(quantity) AS avg_order_quantity,
    MAX(order_date) AS last_sale_date
  FROM table1  -- sales dataframe
  GROUP BY product_id
) s ON p.product_id = s.product_id

LEFT JOIN (
  SELECT 
    product_id,
    AVG(rating) AS avg_rating,
    COUNT(*) AS review_count,
    SUM(CASE WHEN rating >= 4 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS positive_review_pct
  FROM table2  -- reviews dataframe
  GROUP BY product_id
) r ON p.product_id = r.product_id

WHERE 
  p.is_active = true

ORDER BY 
  popularity_score DESC NULLS LAST;

Join Types for Dataframes

INNER JOIN

Returns only rows that have matches in both dataframes.

SELECT *
FROM table1 t1
INNER JOIN table2 t2 
  ON t1.id = t2.id;

Use When: You only want data that exists in both dataframes.

LEFT JOIN

Returns all rows from the left dataframe, matched rows from the right.

SELECT *
FROM table1 t1
LEFT JOIN table2 t2 
  ON t1.id = t2.id;

Use When: You want all data from the primary dataframe, with optional data from the secondary.

RIGHT JOIN

Returns all rows from the right dataframe, matched rows from the left.

SELECT *
FROM table1 t1
RIGHT JOIN table2 t2 
  ON t1.id = t2.id;

Use When: You want all data from the secondary dataframe, with optional data from the primary.

FULL OUTER JOIN

Returns all rows from both dataframes, with NULLs for missing matches.

SELECT *
FROM table1 t1
FULL OUTER JOIN table2 t2 
  ON t1.id = t2.id;

Use When: You want to combine data from both dataframes regardless of matches.

Handling Data Type Mismatches

When joining dataframes, ensure join keys are compatible:

String vs Numeric IDs

-- Convert to compatible types
SELECT *
FROM table1 t1
JOIN table2 t2 
  ON CAST(t1.id AS VARCHAR) = t2.id;

Handling NULL Values

-- Use COALESCE for unified IDs
SELECT 
  COALESCE(t1.id, t2.id) AS unified_id
FROM table1 t1
FULL OUTER JOIN table2 t2 
  ON t1.id = t2.id;

Case-Insensitive Matching

-- Normalize case for matching
SELECT *
FROM table1 t1
JOIN table2 t2 
  ON LOWER(t1.email) = LOWER(t2.email);

Best Practices

1. Use Descriptive Aliases

-- Good
SELECT 
  events.event_id,
  users.user_name
FROM table0 events
JOIN table1 users

-- Avoid
SELECT 
  t1.id,
  t2.name
FROM table0 t1
JOIN table1 t2

2. Select Specific Columns

-- Good: Explicit columns
SELECT 
  events.event_id,
  events.event_type,
  users.user_name

-- Avoid: SELECT * in production
SELECT *

3. Filter Early

-- Good: Filter in WHERE clause
FROM table0 events
WHERE events.event_type = 'purchase'
JOIN table1 users

-- Avoid: Filtering after join when possible
FROM table0 events
JOIN table1 users
WHERE events.event_type = 'purchase'

4. Handle NULLs Explicitly

SELECT 
  COALESCE(column1, 0) AS column1,
  COALESCE(column2, 'Unknown') AS column2
FROM ...

5. Verify Dataframe Existence

Before querying, ensure your dataframes are loaded and named correctly. Test with a simple query first:

SELECT COUNT(*) FROM table0;

Common Issues and Solutions

Issue: “Table not found” Error

Cause: Dataframe doesn’t exist or name is incorrect. Solution:

Verify dataframe is loaded in your session
Check dataframe name matches exactly (case-sensitive)
Ensure datasource is set to “Local Dataframes”

Issue: Join Returns No Results

Cause: Join keys don’t match or have incompatible types. Solution:

Verify join keys exist in both dataframes
Check data types match (use CAST if needed)

Test join keys individually:

SELECT DISTINCT id FROM table1;
SELECT DISTINCT id FROM table2;

Issue: Column Name Conflicts

Cause: Both dataframes have columns with the same name. Solution:

Use table aliases and qualify column names
Use AS to rename columns in SELECT

Example:

SELECT 
  t1.id AS event_id,
  t2.id AS user_id

Issue: Performance Issues

Cause: Large dataframes or complex joins. Solution:

Filter dataframes before joining
Use LIMIT to test with smaller datasets
Consider aggregating dataframes first

Next Steps

Now that you understand cross-database joins:

Writing Your First View - Create comprehensive views using joins
View Best Practices - Optimize your views for performance and maintainability

Create Unified Views

Learn how to combine dataframe queries with database queries

Introduction

Learn

Examples

Help

​Overview

​What Are Local Dataframes?

​Setting Up Dataframe Queries

​Basic Dataframe Queries

​Selecting from a Single Dataframe

​Selecting Specific Columns

​Filtering and Sorting

​Joining Multiple Dataframes

​Basic Join Syntax

​Example: Joining on Common ID

​Advanced Join Patterns

​Multiple Dataframe Joins

​Left Join for Optional Data

​Full Outer Join

​Real-World Examples

​Example 1: Customer Event Analysis

​Example 2: Marketing Attribution

​Example 3: Product Performance Analysis

​Join Types for Dataframes

​INNER JOIN

​LEFT JOIN

​RIGHT JOIN

​FULL OUTER JOIN

​Handling Data Type Mismatches

​String vs Numeric IDs

​Handling NULL Values

​Case-Insensitive Matching

​Best Practices

​1. Use Descriptive Aliases

​2. Select Specific Columns

​3. Filter Early

​4. Handle NULLs Explicitly

​5. Verify Dataframe Existence

​Common Issues and Solutions

​Issue: “Table not found” Error

​Issue: Join Returns No Results

​Issue: Column Name Conflicts

​Issue: Performance Issues

​Next Steps

Create Unified Views

Overview

What Are Local Dataframes?

Setting Up Dataframe Queries

Basic Dataframe Queries

Selecting from a Single Dataframe

Selecting Specific Columns

Filtering and Sorting

Joining Multiple Dataframes

Basic Join Syntax

Example: Joining on Common ID

Advanced Join Patterns

Multiple Dataframe Joins

Left Join for Optional Data

Full Outer Join

Real-World Examples

Example 1: Customer Event Analysis

Example 2: Marketing Attribution

Example 3: Product Performance Analysis

Join Types for Dataframes

INNER JOIN

LEFT JOIN

RIGHT JOIN

FULL OUTER JOIN

Handling Data Type Mismatches

String vs Numeric IDs

Handling NULL Values

Case-Insensitive Matching

Best Practices

1. Use Descriptive Aliases

2. Select Specific Columns

3. Filter Early

4. Handle NULLs Explicitly

5. Verify Dataframe Existence

Common Issues and Solutions

Issue: “Table not found” Error

Issue: Join Returns No Results

Issue: Column Name Conflicts

Issue: Performance Issues

Next Steps