#!/bin/bash
## filter receives Apache-logrecords, produces a 404-report (raw or summarized) -- used by WEBGETLOGS -- Eugene Reimer 2010-03-11;
## PREREQ: counted-uniquify-ordered -- from http://ereimer.net/programs/general-purpose-scripts.htm
## USAGE: 404report FORMAT
## where FORMAT is one of: raw byfile bywho bywhatF byfileF bywhoF bywhatF noise
## the ending-in-F formats are filtered having noise removed; FORMAT:noise shows only the noise; byfileF is the default;
[ $# -eq 0 ] && set byfileF ##==default is byfileF; was byfile
case $1 in
raw) grep ' 404 ' ;; ##raw: raw logrecords
byfile) grep ' 404 ' |sed 's|^[^"]*"||; s|".*||' |counted-uniquify-ordered ;; ##byfile: summarized by Filename
bywho) grep ' 404 ' |sed 's| .*||' |counted-uniquify-ordered ;; ##bywho : summarized by IP-address
bywhat) grep ' 404 ' |sed 's|"$||; s|.*"||' |counted-uniquify-ordered ;; ##bywhat: summarized by UserAgent
byfileF) grep ' 404 ' |sed 's|^[^"]*"||; s|".*||' |egrep -vi '\.(php|asp)' |counted-uniquify-ordered ;; ##byfileF: summarized by Filename and filtered
bywhoF) grep ' 404 ' |sed 's| .*||' |egrep -vi '\.(php|asp)' |counted-uniquify-ordered ;; ##bywhoF: summarized by IP-address and filtered
bywhatF) grep ' 404 ' |sed 's|"$||; s|.*"||' |egrep -vi '\.(php|asp)' |counted-uniquify-ordered ;; ##bywhatF: summarized by UserAgent and filtered
noise) grep ' 404 ' |sed 's|^[^"]*"||; s|".*||' |egrep -i '\.(php|asp)' |counted-uniquify-ordered ;; ##noise: summarized by Filename only the noise ones
*) echo "==404report: unknown arg $1" ;; ##bad arg
esac
exit
=========
== NOTES:
=========
cmdline: grep ' 404 ' produces "raw" 404-report; but want to summarize;
considered: this script decides whether "email-alert justified" based on counts etc, passing return-code; decided BETTER done in WEBGETLOGS??
Note: WEBGETLOGS presently uses this to make daily (summarized "byfile") 404reports by PROJ; may want longer-term reports, eg: monthly??
SUMMARIZING:
============
(A) sed-cmd to isolate filename, then counted-uniquify-ordered to make a counted uniquified ordered-by-count summary;
will sometimes want to see who/what is producing 404s, but 'tis easy enough to grep raw logfiles for those details, ie: no need for raw + summarized reports??
how much of this is needed?? ereimer.net is getting a lot of 404s mostly from bots still using pre-directory-reorg filenames...
(B) keep IP-address, UserAgent, as well as Filename; would still get fairly good combining since multiple-404s are often from same who+what...
(C) produce 3 tables: top-10 Filenames, top-10 IP-addresses (who), top-10 UserAgents (what, to spot bots);
==would require revised test-for-unusual in WEBGETLOGS;
NOTE: have written the (trivial) cmdlines needed for each of the above; although (B) would require some rearranging...
FILTERING:
==========
need to ignore "noise" requests to avoid nuisance email-alerts from WEBGETLOGS?? (Recall reason for 404-report: to help spot Redirect-needed)
to design such filtering, may want long-term summaries which are easily made;
==COMMON NOISE 404s:
.../scripts/setup.php --or possibly: .../scripts/...
.../phpMyAdmin-2.6.3/main.php --also with other php-versions eg 2.2.3 2.6.2-rc1; prefixed by eg /admin /database /db /mysql /sql
/index.php?...
/add-to-cart.asp?...
/ShoppingCart.aspx?...
/viewcart.asp?...
==consider any ".php|.asp|.aspx" suffix implies noise; "florafind" website will then need to use SUFFIX OTHER THAN ,php for its PHP-pages==!!==
==ANOMALOUS 404s:
/programs/$Click <--from /programs/scalepix bash-program containing "", and very weird parsing by some robot--??--
/naver.js /navER.js <--from /nav1-head.txt containing "" -- though inside an HTML-comment--??--
/navER.css <--from /nav1-head.txt containing " -- though inside an HTML-comment--??--
GOING THROUGH ALL 404s:
/tmp/404report-ALL-php: counted-uniquify-ordered list of all 404s containing .php|.asp, made with:
grep ' 404 ' /pix/WEBLOGS/access_log* |sed 's|^[^"]*"||; s|".*||' |egrep -i '\.(php|asp)' |counted-uniquify-ordered >/tmp/404report-ALL-php
==is there any reasonably simple pattern that matches "most" of them?? ==NOPE, not other than the brutally simple .php|.asp==!!==
==means never using .php suffix on webpages even for site using PHP; solution: will just use .htm==!!==
2010-10-09: added the with-filtering summarizing-flavours (byfileF, bywhoF, bywhatF); and made byfileF the default;
(temp) lowered triggering thresholds, in WEBGETLOGS, from 5,20 to 3,6; but went back to 5,20;