The Streams API

Project: Data Analysis with Streams

15 min Lesson 10 of 13

Project: Data Analysis with Streams

In this capstone lesson you will apply everything from the tutorial — filtering, mapping, collecting, reducing, flatMapping, sorting, and working with Optional — to a single, realistic scenario: analysing a dataset of employee records. By the end you will have one cohesive program that asks ten real business questions and answers each one with a focused stream pipeline.

The Dataset

We start with a simple record to represent each employee. Records give us immutability and auto-generated constructors, getters, and toString for free.

public record Employee(
    String name,
    String department,
    double salary,
    int yearsOfExperience,
    List<String> skills
) {}

Then we build a list that we will query throughout the project:

import java.util.*;
import java.util.stream.*;

List<Employee> employees = List.of(
    new Employee("Alice",   "Engineering", 95_000, 7, List.of("Java", "Kotlin", "SQL")),
    new Employee("Bob",     "Engineering", 82_000, 3, List.of("Java", "Python")),
    new Employee("Carol",   "Marketing",   68_000, 5, List.of("SEO", "Analytics")),
    new Employee("David",   "Engineering", 110_000, 12, List.of("Java", "Scala", "Spark")),
    new Employee("Eve",     "HR",          60_000, 2, List.of("Communication", "Excel")),
    new Employee("Frank",   "Marketing",   73_000, 6, List.of("SEO", "PPC", "Analytics")),
    new Employee("Grace",   "HR",          67_000, 8, List.of("Recruiting", "Excel")),
    new Employee("Henry",   "Engineering", 91_000, 5, List.of("Python", "Docker", "SQL")),
    new Employee("Irene",   "Marketing",   78_000, 9, List.of("Analytics", "Branding")),
    new Employee("James",   "Engineering", 99_000, 10, List.of("Java", "Kubernetes", "SQL"))
);

Why use a record? Records (introduced in Java 16) are perfect for plain data carriers like rows in a dataset. They enforce immutability, eliminate boilerplate, and signal to readers that the class is purely a data holder with no hidden behaviour.

Question 1 — How many employees are in Engineering?

long engineeringCount = employees.stream()
    .filter(e -> e.department().equals("Engineering"))
    .count();

System.out.println("Engineering headcount: " + engineeringCount); // 5

Question 2 — What is the average salary across the whole company?

OptionalDouble avgSalary = employees.stream()
    .mapToDouble(Employee::salary)
    .average();

avgSalary.ifPresent(avg ->
    System.out.printf("Company average salary: $%.2f%n", avg));

Question 3 — Who is the highest-paid employee?

Optional<Employee> topEarner = employees.stream()
    .max(Comparator.comparingDouble(Employee::salary));

topEarner.ifPresent(e ->
    System.out.println("Top earner: " + e.name() + " ($" + e.salary() + ")"));

Question 4 — List all unique skills used in Engineering

flatMap is the right tool here: each employee has a list of skills, so we need to flatten many lists into one stream before deduplicating.

List<String> engineeringSkills = employees.stream()
    .filter(e -> e.department().equals("Engineering"))
    .flatMap(e -> e.skills().stream())
    .distinct()
    .sorted()
    .collect(Collectors.toList());

System.out.println("Engineering skills: " + engineeringSkills);
// [Docker, Java, Kotlin, Kubernetes, Python, SQL, Scala, Spark]

Question 5 — Average salary per department

Collectors.groupingBy combined with a downstream averagingDouble collector answers this in one pass:

Map<String, Double> avgByDept = employees.stream()
    .collect(Collectors.groupingBy(
        Employee::department,
        Collectors.averagingDouble(Employee::salary)
    ));

avgByDept.forEach((dept, avg) ->
    System.out.printf("%-15s avg salary: $%.2f%n", dept, avg));

Question 6 — Names of employees earning above $90,000, sorted alphabetically

List<String> highEarnerNames = employees.stream()
    .filter(e -> e.salary() > 90_000)
    .map(Employee::name)
    .sorted()
    .collect(Collectors.toList());

System.out.println("Earning > $90k: " + highEarnerNames);
// [Alice, David, James, James] — wait, let's verify

Chain filter before map. Filtering first reduces the number of elements that flow into the more expensive mapping step. While the JVM can sometimes re-order operations, writing filter → map makes the intent clear and is always safe.

Question 7 — Total salary budget per department

Map<String, Double> budgetByDept = employees.stream()
    .collect(Collectors.groupingBy(
        Employee::department,
        Collectors.summingDouble(Employee::salary)
    ));

budgetByDept.forEach((dept, total) ->
    System.out.printf("%-15s total budget: $%.0f%n", dept, total));

Question 8 — The most experienced employee in each department

Collectors.toMap with a merge function picks the winner when two employees map to the same key:

Map<String, Employee> mostExperienced = employees.stream()
    .collect(Collectors.toMap(
        Employee::department,
        e -> e,
        (a, b) -> a.yearsOfExperience() >= b.yearsOfExperience() ? a : b
    ));

mostExperienced.forEach((dept, e) ->
    System.out.println(dept + " → " + e.name() + " (" + e.yearsOfExperience() + " yrs)"));

Question 9 — Do any employees know both Java and SQL?

Use anyMatch for a short-circuiting existence check — it stops as soon as a match is found:

boolean javaAndSql = employees.stream()
    .anyMatch(e -> e.skills().containsAll(List.of("Java", "SQL")));

System.out.println("Someone knows Java & SQL: " + javaAndSql); // true

Question 10 — Summary statistics for Engineering salaries

DoubleSummaryStatistics captures count, sum, min, max, and average in a single terminal operation:

DoubleSummaryStatistics stats = employees.stream()
    .filter(e -> e.department().equals("Engineering"))
    .mapToDouble(Employee::salary)
    .summaryStatistics();

System.out.println("Engineering salary stats:");
System.out.println("  Count : " + stats.getCount());
System.out.printf ("  Min   : $%.0f%n", stats.getMin());
System.out.printf ("  Max   : $%.0f%n", stats.getMax());
System.out.printf ("  Avg   : $%.2f%n", stats.getAverage());
System.out.printf ("  Total : $%.0f%n", stats.getSum());

Putting It All Together — What You Practised

filter + count — headcount by department (Q1).
mapToDouble + average — numeric aggregation with OptionalDouble (Q2).
max with Comparator — finding a single winner via Optional (Q3).
flatMap + distinct + sorted — flattening nested collections (Q4).
groupingBy + averagingDouble / summingDouble — multi-group aggregation (Q5, Q7).
filter + map + sorted + collect — the classic pipeline (Q6).
toMap with merge function — keyed aggregation with conflict resolution (Q8).
anyMatch — short-circuit existence check (Q9).
summaryStatistics — bulk numeric stats in one pass (Q10).

Streams are not a silver bullet. For very small lists a plain for loop is simpler and just as fast. Choose streams when the declarative style makes the intent clearer — which it almost always does for filtering, grouping, and aggregating real datasets.

Summary

You have now built a complete data-analysis program using nothing but the Streams API. The key insight is that every business question maps naturally to a pipeline: filter down to the relevant rows, map or flatMap to the values you care about, then collect or reduce to the final answer. Master that mental model and you can query any in-memory dataset fluently in Java.