I needed to check a website , because some links doesn’t exist and when you open some web pages you get a “404”.
#!/bin/bash
echo "Checking broken links on " $1
echo "Extract all links in the main page..."
wget --spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1
echo "Identify lines that contains sub-pages and drop to dirs.log file..."
cat weblinks.log | grep "index.html" > dirs.log
echo " Extracting URLs..."
awk -F " " '{print $3}' dirs.log > dirs2.log
sed s/"URL:"/""/g dirs2.log > dirs3.log
sort dirs3.log | uniq > dirs4.log
echo " checking every sub-page..."
while read line; do echo "processing $line";fn=$(date +%s).log;echo $line > summary.log; wget --spider -r -nd -nv -H -l 1 -w 2 -o $fn "$line"; echo "$fn done" ; tail -n 15 > summary.log; done <dirs4.log > process_results.log
echo "Done"
Explaining the messy code:
$1 | pass the website url as an argument |
wget –spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1 | check the main page ($1) and will scan 1 depth level (-l 1) waiting 2 seconds between every request (-w 2) and will dump its results (weblinks.log file) |
cat weblinks.log | grep “index.html” > dirs.log | extract all the subpages ( identified with index.html in the previous file) and will drop in another file ( dirs.log) |
awk -F ” ” ‘{print $3}’ dirs.log > dirs2.log sed s/”URL:”/””/g dirs2.log > dirs3.log | extract only the url from the list |
sort dirs3.log | uniq > dirs4.log | sort the list and remove duplicates. |
while read line; do echo “processing $line”;fn=$(date +%s).log;echo $line > summary.log; wget –spider -r -nd -nv -H -l 1 -w 2 -o $fn “$line”; echo “$fn done” ; tail -n 15 > summary.log; done process_results.log | – read every line of the file containing urls – set up a timestamp as a filename for a log file per every url (fn=$(date +%s).log) – add the url to a final log file (echo $line > summary.log) – do a second scan on every subpage – copy the last 15 rows in the summary file – keep a paralel log file (> process_results.log) – create file summary.log containing urls and broken links , and a log file process_results.log) |