Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the copy-the-code domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /var/www/html/pcx3.com/wp-includes/functions.php on line 6121

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the pb-seo-friendly-images domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /var/www/html/pcx3.com/wp-includes/functions.php on line 6121

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the johannes domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /var/www/html/pcx3.com/wp-includes/functions.php on line 6121
Bash script to check broken links in a website or 404 messages - PC✗3
Bash script to check broken links in a website or 404 messages

Bash script to check broken links in a website or 404 messages

I needed to check a website , because some links doesn’t exist and when you open some web pages  you get a “404”.

#!/bin/bash
echo  "Checking broken links on " $1
echo "Extract all links in the main page..."
wget --spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1
echo "Identify lines that contains sub-pages and drop to dirs.log file..."
cat weblinks.log | grep "index.html" > dirs.log
echo " Extracting URLs..."
awk -F " " '{print $3}' dirs.log > dirs2.log
sed s/"URL:"/""/g dirs2.log > dirs3.log
sort dirs3.log | uniq > dirs4.log
echo " checking every sub-page..."
while read line; do echo "processing $line";fn=$(date +%s).log;echo $line > summary.log; wget --spider -r -nd -nv -H -l 1 -w 2 -o $fn "$line"; echo "$fn done" ; tail -n 15 > summary.log; done <dirs4.log > process_results.log
echo "Done"

Explaining the messy code:

$1pass the website url as an argument
wget –spider -r -nd -nv -H -l 1 -w 2 -o weblinks.log $1check the main page ($1) and will scan 1 depth level (-l 1) waiting 2 seconds between every request (-w 2) and will dump its results (weblinks.log file)
cat weblinks.log | grep “index.html” > dirs.logextract all the subpages ( identified with index.html in the previous file) and will drop in another file ( dirs.log)
awk -F ” ” ‘{print $3}’ dirs.log > dirs2.log
sed s/”URL:”/””/g dirs2.log > dirs3.log
extract only the url from the list
sort dirs3.log | uniq > dirs4.logsort the list and remove duplicates.
while read line; do echo “processing $line”;fn=$(date +%s).log;echo $line > summary.log; wget –spider -r -nd -nv -H -l 1 -w 2 -o $fn “$line”; echo “$fn done” ; tail -n 15 > summary.log; done process_results.log– read every line of the file containing urls
– set up a timestamp as a filename for a log file per every url (fn=$(date +%s).log)
– add the url to a final log file (echo $line > summary.log)
– do a second scan on every subpage
– copy the last 15 rows in the summary file
– keep a paralel log file (> process_results.log)
– create file summary.log containing urls and broken links , and a log file process_results.log)
whoami
Stefan Pejcic
Join the discussion

I enjoy constructive responses and professional comments to my posts, and invite anyone to comment or link to my site.