specification S. Ram Stefan L. Ram August 13, 2004 Filode - A Filename Representation of Text Abstract This memo describes a notation for Unicode text. The notation only uses the uppercase latin letters from "A" to "Z" and the arabic digits from "0" to "9". It can be used to represent any Unicode text, when only a limited set of characters is available. For example, one might want to save a Usenet-article to a file using the message-ID as a filename. A message-ID, however, might contain characters that are not allowed within a filename. Filode might be used to convert the message-ID to a representation using only characters that are allowed in a filename. So a filename can be created from a message-ID in a unique way. For example, the message-ID "<6d7814d4.0408061939.3856e8a4@posting.google.com>" is represented by the text "XL6D7814D4XP0408061939XP3856E8A4XVPOSTINGXPGOOGLEXPCOMXR". The representation is still somewhat readable, as a left and right angle bracket is represented by "XL" and "XR", a point by "XP", and most digits and lower case letters are represented by themselves. Ram [Page 1] Filode - A Filename Representation of Text August 2004 Table of Contents 1. Interpretation of Filode . . . . . . . . . . . . . . . . . . . 3 1.1 Arabian Digits From "0" To "9" . . . . . . . . . . . . . . 3 1.2 Uppercase Latin Letters From "A" To "W" . . . . . . . . . 3 1.3 Sequences Starting With "Y" . . . . . . . . . . . . . . . 3 1.4 Sequences Starting With "X" And an Uppercase Letter From "G" To "W" . . . . . . . . . . . . . . . . . . . . . 3 1.5 Sequences Starting With "X" And an Uppercase Hexadecimal Digit . . . . . . . . . . . . . . . . . . . . 4 1.6 Other Rules For Filode-Text . . . . . . . . . . . . . . . 4 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 4 A. Examples Of Texts And Their Filode-Representation . . . . . . 4 B. A Perl-Subroutine to convert Text to Filode . . . . . . . . . 5 C. A Java-Method to convert Text to Filode . . . . . . . . . . . 5 D. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 7 Ram [Page 2] Filode - A Filename Representation of Text August 2004 1. Interpretation of Filode A Filode-interpreter starts to read a Filode-Text in its start-state. Several characters or character sequences might occur in start state and upon their completion will end in the start-state again. 1.1 Arabian Digits From "0" To "9" The arabian digits from "0" to "9" (Unicode code points from 48 to 57) represent themselves. 1.2 Uppercase Latin Letters From "A" To "W" The uppercase latin letters from "A" to "W" (Unicode code points from 65 to 87) represent their lowercase counterparts from "a" to "w" (Unicode code points from 97 to 119). 1.3 Sequences Starting With "Y" An uppercase "Y" followed by an uppercase latin letters from "A" to "W" (Unicode code points from 65 to 87) represent their that second uppercase letter from "A" to "W" (Unicode code points from 65 to 87). 1.4 Sequences Starting With "X" And an Uppercase Letter From "G" To "W" An uppercase "X" followed by an uppercase latin letters from "A" to "W" (Unicode code points from 65 to 87) represents a character according to the following table. FI name chr co mnemonic XG space " " 20 ghost XH hyphen-minus "-" 2D hyphen XI equals sign "=" 3D is XJ ampersand "&" 26 XK question mark "?" 3F kwestion mark XL less-than sign "<" 3C left angle bracket XM asterisk "*" 2A multiply XN apostrophe "'" 27 acute acceNt XO colon ":" 3A cOlOn XP full stop "." 2E point XQ dollar sign "$" 24 XR greater-than sign ">" 3E right angle bracket XS solidus "/" 2F slash XT tilde "~" 7E tilde XU low line "_" 5F underscore XV commercial at "@" 40 vortex XW exclamation mark "!" 21 wow! Ram [Page 3] Filode - A Filename Representation of Text August 2004 1.5 Sequences Starting With "X" And an Uppercase Hexadecimal Digit An uppercase latin letters from "A" to "F" (Unicode code points from 65 to 70) or an arabic digit from "0" to "9" (Unicode code points from 48 to 57) is called a hexadecimal digit. An uppercase "X" followed by a hexadecimal digit starts a general code point representation. This start pair may be directly followed by more hexadecimal digits. The last hexadecimal digit must be followed by the uppercase letter "Z" (Unicode code point 90), which terminates the general code point representation. The general code point representation indicates an number, which can be obtained by interpreting the sequence of hexadecimal digits between the "X and the "Z" as a non-negative hexadecimal number. The general code point representation represents the character with that number as its code point. For example, the sequence "X7CZ" represents the vertical line "|" (Unicode code point 124). 1.6 Other Rules For Filode-Text Every sequence of characters beginning in start state must be one of the sequences being described above. Author's Address Stefan L. Ram Stefan L. Ram EMail: ram@zedat.fu-berlin.de URI: http://www.purl.org/stefan_ram/ Appendix A. Examples Of Texts And Their Filode-Representation Text Filode = XI == XIXI a A aa AA A YA AA YAYA aA AYA Aa YAA a= AXI 9k-v 9KXHV ,r3 X2CZR3 !\" XGXWX22ZX23Z Ram [Page 4] Filode - A Filename Representation of Text August 2004 0123 0123 :;<= XOX3BZXLXI ABCD YAYBYCYD abcd ABCD Appendix B. A Perl-Subroutine to convert Text to Filode sub filode($) { my( $t ) = @_; $t =~ s/([^a-wA-W0-9])/sprintf("X%xZ",ord($1))/eg; $t =~ s/([A-W])/Y$1/g; $t =~ s/([a-w])/uc $1/ge; $t =~ s/X20Z/XG/g; $t =~ s/X2DZ/XH/g; $t =~ s/X3DZ/XI/g; $t =~ s/X26Z/XJ/g; $t =~ s/X3FZ/XK/g; $t =~ s/X3CZ/XL/g; $t =~ s/X2AZ/XM/g; $t =~ s/X27Z/XN/g; $t =~ s/X3AZ/XO/g; $t =~ s/X2EZ/XP/g; $t =~ s/X24Z/XQ/g; $t =~ s/X3EZ/XR/g; $t =~ s/X2FZ/XS/g; $t =~ s/X7EZ/XT/g; $t =~ s/X5FZ/XU/g; $t =~ s/X40Z/XV/g; $t =~ s/X21Z/XW/g; return $t; } Appendix C. A Java-Method to convert Text to Filode public static java.lang.String filode ( final java.lang.String text ) { final java.lang.StringBuilder buffer = new java.lang.StringBuilder( "" ); { final int length = text.length(); for( int i = 0; i < length; ) { final int code = text.codePointAt( i ); i += java.lang.Character.charCount( code ); if(( code < 'a' || code > 'w' )&& ( code < 'A' || code > 'W' )&& Ram [Page 5] Filode - A Filename Representation of Text August 2004 ( code < '0' || code > '9' )) { buffer.append( "X" ); buffer.append( java.lang.Integer.toString( code, 16 ) ); buffer.append( "Z" ); } else buffer.appendCodePoint( code ); }} int length = buffer.length(); { for( int i = 0; i < length; ++i ) { final char s = buffer.charAt( i ); if( s >= 'A' && s <= 'W' ) { buffer.insert( i, 'Y' ); ++i; ++length; }}} { for( int i = 0; i < length; ++i ) { final char s = buffer.charAt( i ); if( s >= 'a' && s <= 'w' ) { buffer.setCharAt ( i, java.lang.Character.toUpperCase( s )); }}} int target = 0; { int source = 0; while( source < length ) { final char s = buffer.charAt( source ); if( s == 'X' ) { if( source + 3 < length ) { final char t = buffer.charAt( source + 1 ); if( t >= '2' && t <= '7' ) { final String string = buffer.substring ( source + 1, source + 4 ); final char replacementChar = ( t == '2' )? ( string.equals( "20Z" )? 'G' : string.equals( "21Z" )? 'W' : string.equals( "24Z" )? 'Q' : string.equals( "26Z" )? 'J' : string.equals( "27Z" )? 'N' : string.equals( "2AZ" )? 'M' : string.equals( "2DZ" )? 'H' : string.equals( "2EZ" )? 'P' : string.equals( "2FZ" )? 'S' : 0 ): ( string.equals( "3AZ" )? 'O' : string.equals( "3CZ" )? 'L' : string.equals( "3DZ" )? 'I' : string.equals( "3EZ" )? 'R' : string.equals( "3FZ" )? 'K' : string.equals( "40Z" )? 'V' : string.equals( "5FZ" )? 'U' : string.equals( "7EZ" )? 'T' : 0 ); Ram [Page 6] Filode - A Filename Representation of Text August 2004 if( replacementChar == 0 ) { if( target != source )buffer.setCharAt( target, s ); } else { if( target != source ) buffer.setCharAt( target, s ); ++source; ++target; buffer.setCharAt( target, replacementChar ); source += 2; } } else { if( target != source )buffer.setCharAt( target, s ); } } else { if( target != source )buffer.setCharAt( target, s ); } } else { if( target != source )buffer.setCharAt( target, s ); } ++source; ++target; }} return buffer.substring( 0, target ); } Appendix D. Acknowledgements The author gratefully acknowledges the contributions of contributors. Ram [Page 7]