Add "noarchive" to all HTML -files in a directory
This is unfinished experimental code.
Try and use completely on your own risk.
For example, when I have the following file "source.html" in the directory "C:\noarchive", the program generates the file "source.html.tmp". Removing two comment characters in the Perl source code, the original file could be replaced. The program treats all files in the directory and all subdirectories with names ending in "source.html".
source.html
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>source.html.tmp
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
<meta name="robots" content="noarchive" />
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>
When run on the output of the previous phase, it generates the following output. The "noarchive" is not added twice.
source.html.tmp (1)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
<meta name="robots" content="noarchive" />
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>
Here is a source file, that already has meta-elements
source.html (2)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
<meta name="alpha" content="beta" />
<meta name="robots" content="index, follow" />
<meta name="gamma" content="delta" />
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>
And the output is:
source.html.tmp (2)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
<meta name="alpha" content="beta" />
<meta name="robots" content="index, follow, noarchive" />
<meta name="gamma" content="delta" />
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>
However, beware! This program has not been extensively tested or used and probably still has many bugs!
edit.pl
#
# Copyright 2006 Stefan Ram.
#
# This is unfinished experimental code that still needs to be
# reviewed, tested, and edited before used for any purpose.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#
use strict;
use warnings;
use IO::File;
sub assertrobots($)
{ if( $_[ 0 ] !~ /<\s*meta[^>]+name\s*=\s*["']?robots/ )
{ $_[ 0 ] =~ s{</head>}{<meta name="robots" content="" />\n</head>}; }
print $_[ 0 ], "\n"; }
sub addnoarchive_($)
{
print "addnoarchive_ called( \"", $_[ 0 ], "\" )\n";
for my $q ( '"', "'" )
{
$_[ 0 ] =~ s{
content\s*=\s*$q([^$q]*)$q
}{
my( $values )=( $1 );
print "addnoarchive_ matched( $values )", "\n";
$values =~ s/^ //g; $values =~ s/ $//g;
my $text = $values;
if( length $text )
{ if( $text !~ /noarchive/ )
{ $text .= ', noarchive'; }}
else{ $text = 'noarchive'; }
"content=\"$text\""
}egisx;
}
}
# todo: this must clear the attribute first
sub addnoarchive($)
{
for my $q ( '"', "'" )
{
$_[ 0 ] =~ s{
<meta([^>]+)name\s*=\s*${q}robots${q}?([^>]*)>
}{
print STDERR "addnoarchive matched", "\n";
my( $left, $right )=( $1, $2 );
addnoarchive_($left);
addnoarchive_($right);
$left =~ s/^ //g; $left =~ s/ $//g;
$right =~ s/^ //g; $right =~ s/ $//g;
my $text = $left;
if( length $text ){ $text .= ' ' . $right; }
else{ $text = $right; }
"<meta name=\"robots\" $text>"
}egisx;
}
print $_[ 0 ];
}
sub noarchive($$)
{ my( $text, $status )= @_;
print "noachive", $status, "\n";
if( $status )
{ if( $status > 0 ){ assertrobots( $_[ 0 ]); addnoarchive( $_[ 0 ]); }
if( $status < 0 ){ removenoarchive( $_[ 0 ]); hiderobots( $_[ 0 ]); }}
return 1; }
sub modify($)
{ return noarchive( $_[ 0 ], 1 ); }
sub srp($$)
{ my( $source, $target )= @_;
my ${kv_in} = IO::File->new();
{ if( open( $kv_in, "<$source" ))
{ my $infile; undef $/;$infile = <$kv_in>;
if( close( $kv_in ))
{ { my $directive;
if( modify( $infile ))
{ my ${kv_out} = IO::File->new();
if( open( $kv_out, ">$target" ))
{ print $kv_out $infile;
if(close( $kv_out ))
{ # unlink( $source );
# rename( $target, $source );
print $source; }}}}}}}}
sub file($)
{ my( $path )= @_;
if( $path =~ /source\.html$/ )
{ print "srp " . $path . "\n";
srp( $path, $path . ".tmp" );
print $path . "\n"; }}
sub dodir($)
{ my( $entries )= @_;
foreach( @$entries )
{ if( -f ) # file
{ file( $_ ); }
if( -d ) # directory
{ directory( $_ ); }}}
sub directory($)
{ my( $path )= @_;
if( opendir( DIR, $path ))
{ my @entries = map{"$path/$_"} grep !/^(\.|\.\.)$/, readdir DIR;
dodir( \@entries );
closedir DIR; }}
sub start
{ my $path = shift @ARGV if @ARGV;
$path ||= 'C:\\noarchive';
directory( $path ); return 0; }
start;
# $_[0] =~ s/example/example/g;
##############################################################################
=head2 Name
assertrobots - assert a meta-robots element
=head2 Usage
assertrobots( $html_document );
# $html_document an HTML document
=head2 Description
This sub asserts that an HTML document contains a robot-meta-Element:
If none is found, it is created and inserted at the end of the
head element.
This sub uses heuristics, it does not parse the HTML properly, so
it will fail under some conditions.
=head2 Parameters
=over 4
=item $html_document
A html_document text, which might be modified by the sub invocation.
=back
=head2 Result
(none)
=cut