Shell Scripting & Automation

Text Processing: grep, sed & awk

22 min Lesson 7 of 28

Text Processing: grep, sed & awk

In a production environment, raw text is everywhere — application logs containing thousands of lines, nginx access logs growing at tens of thousands of requests per minute, CSV exports from monitoring systems, configuration files spanning hundreds of servers. The ability to search, extract, transform, and summarise that text from the command line — without writing a Python script, without opening a file in an editor, without waiting for a dashboard to load — is one of the highest-leverage skills a DevOps engineer can possess. Three tools do the heavy lifting: grep, sed, and awk. Each has a distinct purpose and the three compose cleanly through pipes.

Mental model: Think of the three tools as a processing pipeline with increasing power. grep selects lines. sed transforms text (character-level substitution and deletion). awk computes over structured fields (arithmetic, conditionals, aggregates). When in doubt, use the weakest tool that solves the problem — a grep one-liner is faster to write, faster to run, and easier for the next engineer to read than an equivalent awk program.

grep — Searching Text at Scale

grep prints every line from its input that matches a pattern. Its name comes from the ed editor command g/re/p (globally match a regular expression and print). The default pattern language is basic regular expressions (BRE); the -E flag enables extended regular expressions (ERE), and -P enables Perl-compatible regular expressions (PCRE) on systems that support it.

Flags you will use daily in production:

  • -i — case-insensitive match
  • -v — invert match (print lines that do NOT match)
  • -r / -R — recursive search through directories
  • -l — print only file names that contain a match
  • -n — prefix each output line with its line number
  • -c — print a count of matching lines per file
  • -A N / -B N / -C N — print N lines After / Before / around each match (context)
  • -o — print only the matching portion of the line, not the whole line
  • -E — extended regex (alternation |, quantifiers +, ?, grouping ())
  • --color=auto — highlight matches (set this in your shell profile)
# Find all ERROR lines in an application log grep "ERROR" /var/log/app/app.log # Case-insensitive search for "timeout" with 2 lines of context grep -i -C 2 "timeout" /var/log/nginx/error.log # Count HTTP 5xx responses per virtual host in an access log grep -c " 5[0-9][0-9] " /var/log/nginx/access.log # Find all .env files in a repo (useful in security audits) grep -rn --include="*.php" "DB_PASSWORD" /var/www/ # Extended regex: match lines containing "OOM" or "killed" (case insensitive) grep -Ei "(oom|killed)" /var/log/syslog # Show only the IP addresses from nginx access log lines that produced 500s grep " 500 " /var/log/nginx/access.log | grep -oE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' # Find lines that do NOT contain "200" or "304" (non-cacheable, non-ok responses) grep -Ev " (200|304) " /var/log/nginx/access.log | tail -100
Production tip — use grep -F for literal strings: When your search pattern contains characters that are special in regex (., *, [, $), use grep -F (fixed string) to skip regex interpretation entirely. Searching for 10.0.1.5 with plain grep matches any character where the dots appear. grep -F "10.0.1.5" matches the literal string. This matters when searching logs for IP addresses, stack trace class names (e.g. java.lang.NullPointerException), or any string that looks like a regex.

sed — Stream Editing for Transformation

sed (stream editor) reads input line by line, applies a script of editing commands, and writes to standard output. It never modifies files in place unless you pass -i. The single most-used sed command is s/pattern/replacement/flags — substitute. Understanding sed means understanding this one command deeply.

The substitution flags that matter:

  • g — replace all occurrences on the line (not just the first)
  • i — case-insensitive match (GNU sed)
  • p — print the line after substitution (useful with -n to print only changed lines)
  • 2, 3 — replace only the Nth occurrence on the line
# Replace "localhost" with "prod-db.internal" in a config file (print result) sed 's/localhost/prod-db.internal/g' database.conf # Edit a file in place (the -i flag); GNU sed requires an empty string for no backup sed -i 's/DEBUG=true/DEBUG=false/g' /etc/app/config.env # Edit in place WITH a backup (safe for production config changes) sed -i.bak 's/DEBUG=true/DEBUG=false/g' /etc/app/config.env # Creates config.env.bak before modifying config.env # Delete all blank lines from a file sed '/^$/d' nginx.conf # Delete lines 1-5 (e.g. strip a file header) sed '1,5d' report.csv # Print only lines 10 through 20 sed -n '10,20p' /var/log/app/app.log # Strip ANSI color codes from log files before storing them sed 's/\x1b\[[0-9;]*m//g' colored.log > clean.log # Uncomment lines that start with "# PROD:" (remove the leading "# PROD:") sed 's/^# PROD: //' config.template > config.prod # Multi-command sed: delete comments and blank lines in one pass sed -e '/^#/d' -e '/^$/d' nginx.conf.template
Production pitfall — sed -i differences between GNU and BSD: On Linux, sed -i '' fails (the empty string after -i is treated as the next argument). On macOS (BSD sed), sed -i '' is the correct in-place edit without backup. To write portable scripts, use sed -i.bak (always creates a backup) or invoke perl -pi -e instead, which behaves consistently. In Dockerfiles and CI pipelines targeting Linux containers this distinction rarely bites you, but it will bite you the moment a teammate runs your script on a Mac.

awk — Field-Oriented Data Processing

awk is a complete programming language designed around the concept of records and fields. By default, it splits each input line (record) on whitespace into fields ($1, $2, ..., $NF for the last field, $0 for the whole line). It runs a pattern-action program against every record. The canonical form is awk '/pattern/ { action }'.

Built-in variables that appear constantly in real scripts:

  • NR — current line (record) number across all files
  • NF — number of fields in the current record
  • FS — field separator (default: whitespace; set with -F)
  • OFS — output field separator (default: space)
  • $0 — the entire current line
  • $1 ... $NF — individual fields
  • BEGIN { } — runs once before any input is read
  • END { } — runs once after all input is consumed
# Print only the 5th column (HTTP status code) from an nginx access log awk '{print $9}' /var/log/nginx/access.log # Print IP address and status code for all non-200 responses awk '$9 != 200 {print $1, $9}' /var/log/nginx/access.log # Sum the bytes sent (field 10) for all requests awk '{total += $10} END {print "Total bytes:", total}' /var/log/nginx/access.log # Count occurrences of each HTTP status code awk '{counts[$9]++} END {for (code in counts) print code, counts[code]}' \ /var/log/nginx/access.log | sort -k1,1n # Parse a colon-separated /etc/passwd and print username + shell awk -F: '{print $1, $NF}' /etc/passwd # Print lines where field 5 (response time in ms) exceeds 1000 awk -F',' '$5 > 1000 {print $0}' slow_requests.csv # Print the header line plus all lines with status 5xx awk 'NR==1 || /,5[0-9][0-9],/' report.csv # Compute average response time from a CSV with time in field 4 awk -F',' 'NR > 1 {sum += $4; count++} END {printf "Avg: %.2f ms\n", sum/count}' metrics.csv

Composing the Three Tools: Real-World Pipelines

The real power of these tools emerges when you pipe them together. Each tool in the chain is a specialist that does one thing extremely well. You read the chain left-to-right as a data-processing narrative.

grep | sed | awk pipeline data flow Log File all lines pipe grep SELECT lines pattern match only 5xx errors pipe sed TRANSFORM text strip timestamps normalise format pipe awk COMPUTE & report count per IP sum, average Report stdout Each tool processes only the output of the previous stage — no intermediate files, no waiting for the full dataset to load
A grep | sed | awk pipeline: lines flow left-to-right, each tool refining the stream further.
# SCENARIO 1: Count unique IPs that generated HTTP 500 errors today # grep selects the 500 lines, awk extracts the IP, sort+uniq counts them grep " 500 " /var/log/nginx/access.log \ | awk '{print $1}' \ | sort | uniq -c | sort -rn \ | head -20 # SCENARIO 2: Extract slow database queries from a MySQL slow log # grep finds the "Query_time" lines, awk filters those over 2 seconds, # sed cleans the format for a CSV report grep "Query_time" /var/log/mysql/slow.log \ | awk '$3 > 2.0 {print $3, $0}' \ | sort -rn \ | head -50 # SCENARIO 3: Summarise disk usage per directory across many servers # (output of: ssh server "du -sh /var/log/app/*") # sed normalises the tab separator to a space, awk sums by name echo "Aggregating..." for server in web01 web02 web03; do ssh "$server" "du -sh /var/log/app/* 2>/dev/null" done \ | sed 's/\t/ /' \ | awk '{sizes[$2] += $1} END {for (d in sizes) print sizes[d], d}' \ | sort -rn # SCENARIO 4: Redact sensitive fields before shipping logs to a SIEM # Replace credit card numbers (simple 16-digit pattern) with [REDACTED] grep -v "^#" app.log \ | sed -E 's/[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}/[REDACTED]/g' \ > sanitised.log

Performance Considerations at Scale

When processing log files that are gigabytes in size — common in production — tool choice and order have real performance implications:

  • Put grep first in the pipeline. It rejects lines early, reducing the data that sed and awk have to process. Filtering 100 million lines down to 10,000 before handing them to awk is orders of magnitude faster than feeding all 100 million to awk.
  • Use grep -F for fixed strings. Regex matching has overhead; plain string matching (Boyer-Moore algorithm) is significantly faster when you do not need regex.
  • Prefer awk over a shell loop for column arithmetic. A shell loop that processes one line per iteration forks a subprocess per line; awk processes millions of lines in a single process invocation.
  • For very large files, consider LC_ALL=C. Prefixing with LC_ALL=C disables multi-byte character handling and can make grep and awk 3-5x faster on ASCII-dominant logs.
# Fastest possible grep on a large ASCII log file LC_ALL=C grep -F "ERROR" /var/log/app/app.log | wc -l # Process a 10 GB gzipped log file without decompressing to disk zcat /var/log/nginx/access.log.gz \ | LC_ALL=C grep " 500 " \ | awk '{print $1}' \ | sort | uniq -c | sort -rn | head -10
Pro practice — save your pipeline as a script: When you have crafted a multi-stage pipeline that you or your team will run repeatedly (daily report, incident investigation template, cost analysis), save it as a named script in your scripts/ directory. Pipelines that live only in your shell history are operational knowledge that vanishes when you leave the team. Document the input format, the expected output, and the external dependencies at the top of the file.

Common Failure Modes

  • grep returns exit code 1 (no match) and halts a pipeline with set -e. In scripts that use set -e (covered in the next lesson), a grep that finds nothing exits with code 1, which aborts the script. Guard against this with grep ... || true when zero matches is an acceptable outcome.
  • sed in-place on a symlink follows the link, not the link itself. When your config file is a symlink (common in Ansible-managed environments), sed -i replaces the symlink target. Verify the result with ls -la after editing.
  • awk field numbering is 1-based, not 0-based. $0 is the whole line; $1 is the first field. Engineers from Python or Go backgrounds assign the wrong field number and silently get garbage output — always verify with a small sample first.
  • Locale-sensitive regex in awk. Character classes like [[:alpha:]] match differently depending on the locale. Set LC_ALL=C for predictable, ASCII-only matching in production scripts.

With grep, sed, and awk in your toolkit — and an understanding of how to compose them — you can answer in seconds questions that would otherwise require writing a Python script, loading data into a database, or waiting for a BI dashboard to refresh. At scale, that speed is the difference between a 10-minute incident and a 10-second diagnosis.