(+351) 21 24 10006  ·  info@bconcepts.pt
Carnaxide, Lisbon
Apache Spark
Apache Spark 4 min

How to join DataFrames in PySpark: step by step

João Barros 04 de July de 2026 4 min read

Combining information from several tables is one of the most common tasks in data engineering. In PySpark, the join method lets you match two DataFrames on a shared column — for example, linking each sale to the name of its customer. A join is essentially the equivalent of SQL's JOIN, but applied to distributed DataFrames. In this guide you will learn, step by step, how to join DataFrames in PySpark, pick the right type and avoid the most common mistakes.

Prerequisites

  • An environment with PySpark available (Databricks, Microsoft Fabric or a local Spark installation).
  • An active SparkSession — in Databricks and Fabric notebooks it already exists under the name spark.
  • Basic DataFrame knowledge: how to create them and how to view data with show().

Step 1: Create two example DataFrames

We will use two small DataFrames: one with customers and one with sales. Each sale has a cliente_id that matches the id of a customer. That shared column is what links the two tables — without a shared key there can be no join.

How to join DataFrames in PySpark: step by step
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

clientes = spark.createDataFrame(
    [(1, "Ana"), (2, "Bruno"), (3, "Carla")],
    ["id", "nome"],
)

vendas = spark.createDataFrame(
    [(101, 1, 250.0), (102, 1, 90.0), (103, 2, 400.0)],
    ["venda_id", "cliente_id", "valor"],
)

Step 2: Join the DataFrames

The join method takes three arguments: the second DataFrame, the join condition and the join type. To link each sale to its customer, we compare vendas.cliente_id with clientes.id. The inner type — the default — keeps only the rows that have a match in both tables.

resultado = vendas.join(
    clientes,
    vendas.cliente_id == clientes.id,
    "inner",
)

resultado.show()

The result combines the columns of both tables: each sale row now also includes the customer's nome. With the example data you get three rows — Ana's two purchases and Bruno's one — each already carrying the matching name. Carla does not appear because she has no sales.

Step 3: Choose the right join type

We do not always want only the matches. A left join keeps every row from the left table, even when there is no match on the right — in those cases the missing columns are filled with null. Just change the third argument:

vendas.join(clientes, vendas.cliente_id == clientes.id, "left").show()

The most common types are inner, left, right and outer. The choice depends on the question you want to answer: "only what has a match" (inner) or "everything from one side, matched or not" (left or right).

Step 4: Avoid duplicate column names

A classic mistake happens when the join columns share the same name in both tables: the result ends up with two identically named columns, and any attempt to select them raises an ambiguity error. The cleanest way to avoid this is to give the key the same name in both tables and pass only that name as text — this way PySpark merges them into a single column:

clientes2 = clientes.withColumnRenamed("id", "cliente_id")

vendas.join(clientes2, "cliente_id", "inner").show()

Now the condition is just "cliente_id" and the result has a single column with that name, with no duplicates.

Tip: when one of the tables is much smaller (like a reference table), wrap it in broadcast to speed up the join and reduce data shuffling between nodes — from pyspark.sql.functions import broadcast and then vendas.join(broadcast(clientes2), "cliente_id").

Check the result

To confirm the join worked, compare the row count before and after. An inner join never returns more matches than actually exist; if the total unexpectedly spikes, it is a sign that the key has repeated values and created a many-to-many relationship.

print("Vendas:", vendas.count())
print("Resultado:", resultado.count())
resultado.printSchema()

The printSchema() lists every final column — you should see the sales columns plus the customer columns (id and nome). If the columns and the count add up, the join is correct.

Conclusion

You can now match two DataFrames in PySpark, choose the right join type and steer clear of the duplicate-column problem — the foundation of almost every data pipeline. The natural next step is to combine the join with groupBy to aggregate the joined data, such as computing total sales per customer. What is the first table you will join in your project?

Share: