Add "`noarchive`" to all HTML -files in a directory

This is unfinished experimental code.

Try and use completely on your own risk.

For example, when I have the following file "source.html" in the directory "C:\noarchive", the program generates the file "source.html.tmp". Removing two comment characters in the Perl source code, the original file could be replaced. The program treats all files in the directory and all subdirectories with names ending in "source.html".

source.html

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE html  
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
  <head> 
    <title>Virtual Library</title> 
  </head> 
  <body> 
    <p>Moved to <a href="http://example.org/">example.org</a>.</p> 
  </body> 
</html>

source.html.tmp

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE html  
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
  <head> 
    <title>Virtual Library</title> 
  <meta name="robots" content="noarchive" /> 
</head> 
  <body> 
    <p>Moved to <a href="http://example.org/">example.org</a>.</p> 
  </body> 
</html>

When run on the output of the previous phase, it generates the following output. The "noarchive" is not added twice.

source.html.tmp (1)

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE html  
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
  <head> 
    <title>Virtual Library</title> 
  <meta name="robots" content="noarchive" /> 
</head> 
  <body> 
    <p>Moved to <a href="http://example.org/">example.org</a>.</p> 
  </body> 
</html>

Here is a source file, that already has meta-elements

source.html (2)

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE html  
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
  <head> 
    <title>Virtual Library</title> 
  <meta name="alpha" content="beta" /> 
  <meta name="robots" content="index, follow" /> 
  <meta name="gamma" content="delta" /> 
</head> 
  <body> 
    <p>Moved to <a href="http://example.org/">example.org</a>.</p> 
  </body> 
</html>

And the output is:

source.html.tmp (2)

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE html  
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
  <head> 
    <title>Virtual Library</title> 
  <meta name="alpha" content="beta" /> 
  <meta name="robots" content="index, follow, noarchive" /> 
  <meta name="gamma" content="delta" /> 
</head> 
  <body> 
    <p>Moved to <a href="http://example.org/">example.org</a>.</p> 
  </body> 
</html>

However, beware! This program has not been extensively tested or used and probably still has many bugs!

edit.pl

# 
# Copyright 2006 Stefan Ram. 
# 
# This is unfinished experimental code that still needs to be 
# reviewed, tested, and edited before used for any purpose. 
# 
# This program is free software; you can redistribute it and/or modify 
# it under the terms of the GNU General Public License as published by 
# the Free Software Foundation; either version 2 of the License, or 
# (at your option) any later version. 
# 
# This program is distributed in the hope that it will be useful, 
# but WITHOUT ANY WARRANTY; without even the implied warranty of 
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
# GNU General Public License for more details. 
# 
# You should have received a copy of the GNU General Public License 
# along with this program; if not, write to the Free Software 
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA 
# 
 
 
use strict; 
use warnings; 
 
use IO::File; 
 
sub assertrobots($) 
{ if( $_[ 0 ] !~ /<\s*meta[^>]+name\s*=\s*["']?robots/ ) 
  { $_[ 0 ] =~ s{</head>}{<meta name="robots" content="" />\n</head>}; } 
  print $_[ 0 ], "\n"; } 
 
sub addnoarchive_($) 
{  
  print "addnoarchive_ called( \"", $_[ 0 ], "\" )\n"; 
  for my $q ( '"', "'" ) 
  { 
    $_[ 0 ] =~ s{ 
      content\s*=\s*$q([^$q]*)$q 
    }{ 
      my( $values )=( $1 ); 
      print "addnoarchive_ matched( $values )", "\n"; 
      $values =~ s/^ //g; $values =~ s/ $//g; 
      my $text = $values; 
      if( length $text ) 
      { if( $text !~ /noarchive/ ) 
        { $text .= ', noarchive'; }} 
      else{ $text = 'noarchive'; } 
      "content=\"$text\"" 
      }egisx; 
    }  	 
   
  } 
   
# todo: this must clear the attribute first 
sub addnoarchive($) 
{  
  for my $q ( '"', "'" ) 
  { 
    $_[ 0 ] =~ s{ 
      <meta([^>]+)name\s*=\s*${q}robots${q}?([^>]*)> 
    }{ 
      print STDERR "addnoarchive matched", "\n"; 
      my( $left, $right )=( $1, $2 ); 
      addnoarchive_($left); 
      addnoarchive_($right); 
      $left =~ s/^ //g; $left =~ s/ $//g; 
      $right =~ s/^ //g; $right =~ s/ $//g; 
      my $text = $left; 
      if( length $text ){ $text .= ' ' . $right; } 
      else{ $text = $right; } 
      "<meta name=\"robots\" $text>" 
      }egisx; 
    } 
    print $_[ 0 ]; 
  } 
 
sub noarchive($$) 
{ my( $text, $status )= @_; 
  print "noachive", $status, "\n"; 
  if( $status ) 
  { if( $status > 0 ){ assertrobots( $_[ 0 ]); addnoarchive( $_[ 0 ]); } 
    if( $status < 0 ){ removenoarchive( $_[ 0 ]); hiderobots( $_[ 0 ]); }} 
  return 1; } 
 
sub modify($) 
{ return noarchive( $_[ 0 ], 1 ); } 
 
sub srp($$) 
{ my( $source, $target )= @_; 
  my ${kv_in}  = IO::File->new(); 
  { if( open( $kv_in, "<$source" )) 
    { my $infile; undef $/;$infile = <$kv_in>; 
      if( close( $kv_in )) 
      { { my $directive; 
          if( modify( $infile )) 
          { my ${kv_out} = IO::File->new(); 
            if( open( $kv_out, ">$target" )) 
            { print $kv_out $infile; 
              if(close( $kv_out )) 
              { # unlink( $source ); 
                # rename( $target, $source ); 
                print $source; }}}}}}}} 
 
sub file($) 
{ my( $path )= @_; 
  if( $path =~ /source\.html$/ ) 
  { print "srp " . $path . "\n"; 
    srp( $path, $path . ".tmp" ); 
    print $path . "\n"; }} 
 
sub dodir($) 
{ my( $entries )= @_; 
  foreach( @$entries ) 
  { if( -f ) # file 
    { file( $_ ); } 
    if( -d ) # directory 
    { directory( $_ ); }}} 
 
sub directory($) 
{ my( $path )= @_; 
  if( opendir( DIR, $path )) 
  { my @entries = map{"$path/$_"} grep !/^(\.|\.\.)$/, readdir DIR; 
    dodir( \@entries ); 
    closedir DIR; }} 
 
sub start 
{ my $path = shift @ARGV if @ARGV;  
  $path ||= 'C:\\noarchive'; 
  directory( $path ); return 0; } 
 
start; 
 
 
# $_[0] =~ s/example/example/g; 
 
############################################################################## 
 
=head2 Name 
 
assertrobots - assert a meta-robots element  
 
=head2 Usage 
 
  assertrobots( $html_document ); 
  # $html_document an HTML document 
 
=head2 Description 
 
This sub asserts that an HTML document contains a robot-meta-Element: 
If none is found, it is created and inserted at the end of the 
head element. 
 
This sub uses heuristics, it does not parse the HTML properly, so 
it will fail under some conditions. 
 
=head2 Parameters 
 
=over 4 
 
=item $html_document 
 
A html_document text, which might be modified by the sub invocation. 
 
=back 
 
=head2 Result 
 
(none) 
 
=cut

Add "noarchive" to all HTML -files in a directory

Add "`noarchive`" to all HTML -files in a directory