How to aggregate data with groupBy in PySpark
Grouping rows by a category and calculating totals — sales by region, orders by customer, or revenue by product — is one of the most common analyses in data work. In PySpark you solve this task with the groupBy method, which behaves exactly the same over a thousand rows or a thousand million. Throughout this guide you will create a sample DataFrame and aggregate it with counts, sums and averages, ending with a simple check to make sure the numbers are right.
Prerequisites
- Python 3.8 or later installed.
- PySpark installed (
pip install pyspark). - Knowing how to read and create a basic DataFrame.
- An environment to run code: a notebook, Databricks or the terminal.
Step 1: Create the SparkSession and a sample DataFrame
Everything in PySpark starts with a SparkSession, the entry point for working with distributed data. Let's create a session and, from a list of tuples, a small DataFrame with sales by region and product. This gives us concrete data to apply groupBy to.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("groupby-exemplo").getOrCreate()
dados = [
("Norte", "Portatil", 1200),
("Norte", "Rato", 25),
("Norte", "Portatil", 1100),
("Sul", "Portatil", 1300),
("Sul", "Rato", 30),
("Centro", "Portatil", 1250),
]
colunas = ["regiao", "produto", "valor"]
df = spark.createDataFrame(dados, colunas)
df.show()
createDataFrame takes the data and the list of column names. df.show() displays the six original rows; notice that some regions repeat — that is exactly what groupBy will bring together into a single group.
Step 2: Count rows per group
The simplest way to aggregate is to count how many rows exist in each group. Call groupBy with the column name and then count().
df.groupBy("regiao").count().show()
groupBy("regiao") returns a GroupedData object — not a result yet, just the definition of how to group. Only when you apply a function such as count() do you get a new DataFrame, with one row per region and the number of sales for each.
Step 3: Sum and average with agg
Counting is rarely enough. To compute several metrics at once, use the agg method with functions from the functions module, usually imported as F. alias gives each result column a readable name.
from pyspark.sql import functions as F
df.groupBy("regiao").agg(
F.sum("valor").alias("total"),
F.avg("valor").alias("media"),
F.count("produto").alias("n_vendas"),
).show()
Now each region has, in a single row, the total sold, the average per sale and the number of sales. Without alias, the columns would get automatic names like sum(valor), which are harder to reuse later.
Step 4: Group by several columns and sort
You can pass more than one column to groupBy for finer detail — for example, the total by region and product at once. Add orderBy to see the largest values first.
df.groupBy("regiao", "produto").agg(
F.sum("valor").alias("total")
).orderBy(F.col("total").desc()).show()
The result has one row for each region-and-product combination, sorted from the highest total to the lowest. Swapping desc() for asc() reverses the order.
Verify the result
A good way to confirm the aggregation is correct is to compare the sum of the groups with the sum of every row: the two must be equal.
df.agg(F.sum("valor").alias("total_geral")).show()
The grand total should be 4905. If you add up the per-region totals from Step 3 (Norte 2325, Sul 1330 and Centro 1250) you get exactly 4905 — a sign that no row was left out or counted twice.
Conclusion
With groupBy, agg and alias you can already answer most "totals by category" questions in PySpark, whatever the volume of data. The natural next step is window functions, which compute aggregates without collapsing the rows — for example, a running total per region. Try it now: can you adapt Step 3 to also show the maximum value of each region with F.max?