Standardizing the Captions on my Photos:  more Sed One-Liners -- by Eugene Reimer 2011-March

I decided to standardize the captions on my photos of living-things, from a mixture of Common-name only, Scientific-name only, both kinds of name in Com=Sci style, to a new style where Scientific-name in parentheses follows Common-name.  I began by constructing a table of name-pairs, that consists of lines like:
<I>Acorus calamus</I>  <b>Sweet flag</b>
<I>Canis lupus</I>     <b>Grey wolf</b>

It is a plain-text file, with 2 columns delimited with HTML-like brackets.  For some species I've used more than one Common-name, and for cases where the accepted Scientific-name has changed recently, my table has more than one of those too.  Multiple names are separated with an Equalsign;  for example:
<I>Rubus arcticus ssp acaulis=Rubus acaulis</I>  <b>Dwarf raspberry=Stemless arctic raspberry=Stemless raspberry</b>

As of 2011-05-07, my sci2com-table contains 1165 such lines, which includes all the species I currently have photos of on my website.  Some Scientific-names are only to genus or family or order.

When I began this conversion, my captions were a bit of a mess in another way:  in the early years proper-nouns were written as one-word in the "CamelCase" style, for place-names, person-names, and species-names -- because of the way I made the Photos-by-Caption page, which required a single-word as the "major-caption".  After going to the use of a colon to separate major- from minor-caption, I became free to use multiple-word names, however I hadn't finished converting the old photo-pages, and they still had many CamelCase names.  My fix-ER-captions-CamelCase does this standardization, in 2 steps: first it inserts a space between any pair of adjacent lowercase+uppercase letters;  for place-name or person-name that's all that's needed (except for McSomething names where the CamelCase form is the norm, so no space is added after a "Mc");  however for plant- and animal names, we also want to revise the capitalization and that's the 2nd step of fix-ER-captions-CamelCase.

Then I used the early version of fix-ER-captions-sci2com-UNDO, followed by fix-ER-captions-sci2com-DO;  to first convert "Com=Sci" to "Com";  then convert "Com" to "Com (Sci)" and "Sci" to "Com (Sci)".  Both scripts were originally written for "Com=Sci" name-pairs, then converted to the "Com (Sci)" style. 

Both of these as well as the 2nd step of fix-ER-captions-CamelCase use the same approach:  a sed s-cmd rearranges a line of the sci2com-table into a sed s-cmd that makes the desired change to a caption.  in other words, the lines of sci2com-table are piped through sed, then piped into a sed-inplace that modifies all photo-webpages.

The s-cmd that modifies captions uses LH-context and RH-context to match only the cases that need revision, and avoid ones that don't.  The "context" is a single character in most cases, however the 1st s-cmd (the "Com-->Com(Sci)" s-cmd) in the "DO" script needs a more complex RH-context, to match either (1) a space plus a character-other-than-Leftparen, or (2) Colon|Quote|Query|Plus.  Here are all the caption-modifying s-cmds for the "<I>Canis lupus</I>  <b>Grey wolf</b>" line of the table:
line from sci2com-table:     <I>Canis lupus</I>  <b>Grey wolf</b>
is modified to (CamelCase):  s/[gG]rey [wW]olf/Grey wolf/
is modified to (UNDO):       s/\([ "+=]\)Grey wolf (Canis lupus)/\1Grey wolf/
is modified to (DO cmd-1):   s/\([ "+=]\)Grey wolf\( [^(]\|[:"?+]\)/\1Grey wolf (Canis lupus)\2/
is modified to (DO cmd-2):   s/\([ "+=]\)Canis lupus\([ :"?+]\)/\1Grey wolf (Canis lupus)\2/
One detail not shown in the s-cmds above:  each of them, except for the Camelcase one, has "/title=/" preceding the s-cmd, so it revises only caption-lines.  (Some photo-pages also contain prose, and I decided against applying these revisions to the prose.)

Here are the scripts:



Send your questions, suggestions, corrections to ereimer@shaw.ca.