Bash script to check broken links in a website or 404 messages

January 1, 2022Linux

I needed to check a website , because some links doesn’t exist and when you open some web pages you get a “404”.

#!/bin/bash
echo  "Checking broken links on " $1
echo "Extract all links in the main page..."
wget --spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1
echo "Identify lines that contains sub-pages and drop to dirs.log file..."
cat weblinks.log | grep "index.html" > dirs.log
echo " Extracting URLs..."
awk -F " " '{print $3}' dirs.log > dirs2.log
sed s/"URL:"/""/g dirs2.log > dirs3.log
sort dirs3.log | uniq > dirs4.log
echo " checking every sub-page..."
while read line; do echo "processing $line";fn=$(date +%s).log;echo $line > summary.log; wget --spider -r -nd -nv -H -l 1 -w 2 -o $fn "$line"; echo "$fn done" ; tail -n 15 > summary.log; done <dirs4.log > process_results.log
echo "Done"

Explaining the messy code:

$1	pass the website url as an argument
wget –spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1	check the main page ($1) and will scan 1 depth level (-l 1) waiting 2 seconds between every request (-w 2) and will dump its results (weblinks.log file)
cat weblinks.log \| grep “index.html” > dirs.log	extract all the subpages ( identified with index.html in the previous file) and will drop in another file ( dirs.log)
awk -F ” ” ‘{print $3}’ dirs.log > dirs2.log sed s/”URL:”/””/g dirs2.log > dirs3.log	extract only the url from the list
sort dirs3.log \| uniq > dirs4.log	sort the list and remove duplicates.
while read line; do echo “processing $line”;fn=$(date +%s).log;echo $line > summary.log; wget –spider -r -nd -nv -H -l 1 -w 2 -o $fn “$line”; echo “$fn done” ; tail -n 15 > summary.log; done process_results.log	– read every line of the file containing urls – set up a timestamp as a filename for a log file per every url (fn=$(date +%s).log) – add the url to a final log file (echo $line > summary.log) – do a second scan on every subpage – copy the last 15 rows in the summary file – keep a paralel log file (> process_results.log) – create file summary.log containing urls and broken links , and a log file process_results.log)