Wrocco
Introduction to Wrocco
Wrocco converts the text of a Microsoft ® Word document to an XML representation. Beyond the raw text, the names of paragraph styles and character styles and document properties and variables are written to the XML file. Wrocco does not export information about direct formatting (without named styles). Tables, pictures and similar page elements are not supported.
Wrocco was written in VBA for Microsoft ® Word 2000, it possibly still may work with later or some earlier versions (not tested).
An Example for Wrocco
The page you are reading has been written with Microsoft (R) Word. It has been converted to XML with Wrocco and then processed further in order to get this XHTML page. You can view the example XML output for this page as it was generated by Wrocco from the Microsoft (R) Word document using the following URI. (The page in the XML file might be a previous version of this page; and some elements were removed.)
https://www.purl.org/stefan_ram/utf-8/720979_doc.xml
Obtaining the Wrocco Source Code
Legal information Wrocco is an experimental software project in pre-alpha state. It is only intended for experienced programmers who can read and understand VBA source code and estimate the risks involved. By using Wrocco the user agrees that he will use it entirely on his own risk and will backup all data before using Wrocco. Wrocco is not public domain. It is copyright 2002–2005 by Stefan Ram. Wrocco may be used for free but Wrocco or any of its parts may not be redistributed as it is or as part of other software by anyone else than Stefan Ram and it may not be mirrored on any other server. It also may not be redistributed via software collections in any form.
Wrocco can be obtained via an HTTP get request as a simple text file with its VBA source code using the following URI .
https://www.purl.org/stefan_ram/utf-8/wrocco
Installing Wrocco
From Word press Alt-F11 to go to VBA . In VBA use Ctrl-R to view the project explorer window. Use the context menu to send the message Insert/Module to your document template file in the project explorer window. A new empty module window should open. Paste the complete Wrocco source code into this window. Save your document template.
Configuring Wrocco
In the source code, search for "Sub Main". In the next line, edit the text "c:\" to the desired output directory.
Running Wrocco
To start Wrocco, move the Cursor to the text "Sub Main" and press the key F5. The active document will be written out as an XML file. If you learn more about VBA , you will learn more about its features and it will see possibilities to start macros even easier. But this is not the right place to teach VBA .
Comments?
I appreciate any bug reports or comments. Contact information is below.
See Also
- Two VBA-macros
- http://www.4haus.de/tips/wordtohtml.html
- http://www.4haus.de/test/htmltags.txt
- This (German language) pages describe tools to remove Office specific parts from HTML files created by Microsoft ® Office.
- http://office.microsoft.com/germany/downloads/2000/Msohtmf2.aspx
- http://office.microsoft.com/germany/assistance/2000/wDosPeeler.aspx
- SGML Author for Word
- This Microsoft®-product was created in 1994, was for Word 6.0 and for Word 97 and was sold for $599. This is product not maintained anymore.
- Office 2003 XML Reference Schemas
- http://www.microsoft.com/office/xml/default.mspx
- FAQ-Entry (German language)
- http://www.netandmore.de/faq/fom-serve/cache/857.html
- MajiX transforms RTF to XML.
- http://www.tetrasix.com
- http://perso.wanadoo.fr/tetrasys/docs-1.2.2/default.html
- R2Net converts RTF to HTML /XML
- http://www.logictran.net/products/
- Software (upCast and downCast ) to convert from Word to XML
- http://www.infinity-loop.de
- Word as HTML /XML /SGML -Editor with the MarkupKit
- http://www.schema.de/sitehtml/site-d/htmlexpo.htm
- German-language report on a Microsoft -patent related to WordML
- heise "hps-23.01.04-000"heise "43948"
- Patent-submission by Microsoft: Word-processing document stored in a single XML file
- http://v3.espacenet.com/origdoc?DB=EPODOC&IDX=EP1376387&QPN=EP1376387
- WordML -Viewer for the Microsoft ® Internet Explorer
- http://www.microsoft.com/downloads/details.aspx?FamilyID=19676b18-1bcd-4852-93ba-0b5a203ea731&DisplayLang=en
- Requires Microsoft ® Word 2003.
- Microsoft Office Assistance: Use Office HTML Filter to Create Web Pages that Download Faster
- http://office.microsoft.com/assistance/preview.aspx?AssetID=HA010548651033
Microsoft ® Office Documents can be imported into the free word processor OpenOffice Writer, i.e., they can be converted by OpenOffice Writer to the format used by OpenOffice Writer. Then software to convert from the format OpenOffice Writer to other formats can be applied. This is yet another way to create XML from a Microsoft ® Word document.
- Writer2LaTeX, Writer2BibTeX and Writer2xhtml
- http://www.hj-gym.dk/~hj/writer2latex/
It is also possible to write a customized RTF -translator in C. An RTF -reader by Microsoft ® can be used as a starting point which is available as C source code.
- Other Problem Areas in RTF
- http://msdn.microsoft.com/library/en-us/dnrtfspec/html/rtfspec_53.asp?frame=true
- Appendix A. How to Write an RTF Reader
- http://latex2rtf.sourceforge.net/rtfspec_45.html
- Survey of Word to HTML conversion solutions
- http://web.archive.org/web/20050316004851/http://www.e.govt.nz/web-guidelines/word-to-html-conversion.asp