Bash script to check broken links in a website or 404 messages

Bash script to check broken links in a website or 404 messages

I needed to check a website , because some links doesn’t exist and when you open some web pages  you get a “404”.

#!/bin/bash
echo  "Checking broken links on " $1
echo "Extract all links in the main page..."
wget --spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1
echo "Identify lines that contains sub-pages and drop to dirs.log file..."
cat weblinks.log | grep "index.html" > dirs.log
echo " Extracting URLs..."
awk -F " " '{print $3}' dirs.log > dirs2.log
sed s/"URL:"/""/g dirs2.log > dirs3.log
sort dirs3.log | uniq > dirs4.log
echo " checking every sub-page..."
while read line; do echo "processing $line";fn=$(date +%s).log;echo $line > summary.log; wget --spider -r -nd -nv -H -l 1 -w 2 -o $fn "$line"; echo "$fn done" ; tail -n 15 > summary.log; done <dirs4.log > process_results.log
echo "Done"

Explaining the messy code:

$1pass the website url as an argument
wget –spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1check the main page ($1) and will scan 1 depth level (-l 1) waiting 2 seconds between every request (-w 2) and will dump its results (weblinks.log file)
cat weblinks.log | grep “index.html” > dirs.logextract all the subpages ( identified with index.html in the previous file) and will drop in another file ( dirs.log)
awk -F ” ” ‘{print $3}’ dirs.log > dirs2.log
sed s/”URL:”/””/g dirs2.log > dirs3.log
extract only the url from the list
sort dirs3.log | uniq > dirs4.logsort the list and remove duplicates.
while read line; do echo “processing $line”;fn=$(date +%s).log;echo $line > summary.log; wget –spider -r -nd -nv -H -l 1 -w 2 -o $fn “$line”; echo “$fn done” ; tail -n 15 > summary.log; done process_results.log– read every line of the file containing urls
– set up a timestamp as a filename for a log file per every url (fn=$(date +%s).log)
– add the url to a final log file (echo $line > summary.log)
– do a second scan on every subpage
– copy the last 15 rows in the summary file
– keep a paralel log file (> process_results.log)
– create file summary.log containing urls and broken links , and a log file process_results.log)
whoami
Stefan Pejcic
Join the discussion

I enjoy constructive responses and professional comments to my posts, and invite anyone to comment or link to my site.