How to remove duplicates in PySpark: step by step

João Barros 04 de July de 2026 4 min read

Removing duplicates in PySpark is one of the most common data cleaning tasks: repeated rows inflate counts, distort averages and cause errors in your reports. In this guide you will learn, step by step, how to remove duplicate rows with the dropDuplicates and distinct methods, either across every column or only on the ones that define your key.

Prerequisites

Python 3 and the pyspark library installed (pip install pyspark).
An active SparkSession (we show how to create one below).
Basic knowledge of Python and DataFrames.

Step 1: Create a SparkSession and sample data

Start Spark and create a small DataFrame with repeated rows on purpose. That way you can see the effect of each method before applying it to your real data.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("remover-duplicados").getOrCreate()

dados = [
    (1, "Ana", "Lisboa"),
    (2, "Bruno", "Porto"),
    (2, "Bruno", "Porto"),
    (3, "Carla", "Braga"),
    (3, "Carla", "Faro"),
]
colunas = ["id", "nome", "cidade"]
df = spark.createDataFrame(dados, colunas)
df.show()

Notice that Bruno's row appears twice, exactly identical, while id 3 (Carla) appears with two different cities. These are two distinct kinds of "duplicate", and each one needs its own treatment.

Step 2: Remove duplicates across all columns

To drop fully identical rows, call dropDuplicates with no arguments. It compares every column and keeps only one copy of each repeated row.

sem_duplicados = df.dropDuplicates()
sem_duplicados.show()

Bruno's duplicated row now appears only once. The df.distinct() method does exactly the same and is just an alternative way to write it. Because Spark has to compare rows across partitions, this operation involves a shuffle; on large tables it pays to deduplicate as early as possible in the pipeline.

Step 3: Remove duplicates by specific columns

Often you want "one record per customer" or "one per id", even if the other columns vary. Pass the list of key columns to dropDuplicates.

sem_duplicados_id = df.dropDuplicates(["id"])
sem_duplicados_id.show()

Now you get a single row per id. Watch out for an important detail: when the other columns hold different values (id 3 had "Braga" and "Faro"), Spark does not guarantee which row it keeps. The result may change between runs.

Common error: assuming dropDuplicates(["id"]) always returns the same row. It does not — if you need a stable result, move on to Step 4.

Step 4: Choose which duplicate to keep

When the row you keep matters (for example, keeping the city in alphabetical order, or the most recent record), use a window function to sort within each key and keep the first one.

from pyspark.sql import Window
from pyspark.sql.functions import row_number, col

janela = Window.partitionBy("id").orderBy(col("cidade").asc())
df_numerado = df.withColumn("linha", row_number().over(janela))
resultado = df_numerado.filter(col("linha") == 1).drop("linha")
resultado.show()

Here, for each id, we keep the city that comes first alphabetically. Just change the orderBy (for example to a date with .desc()) to keep the most recent record instead.

Check the result

The simplest way to confirm is to compare the number of rows before and after. If you removed duplicates, the second value must be smaller.

print("Antes:", df.count())
print("Depois:", sem_duplicados.count())

In our example, we go from 5 to 4 rows after removing Bruno's repetition. Also confirm visually with .show() that there are no identical rows left.

Conclusion

You now know how to remove duplicates in PySpark in three ways: across all columns with dropDuplicates(), by a key with dropDuplicates(["id"]), and in a controlled way with a window function. The natural next step is to handle null values (df.na.drop() or df.na.fill()) and then write the clean data to Parquet. One final tip: before deduplicating, always check how many repeated rows you have and why — a duplicate often hides a problem in the process that created it. Which columns define a "unique" row in your dataset?