OVERVIEW-Part 1
Microsoft introduced the Rich Text Format for specifying simple formatted text with embedded graphics. Initially intended to transfer such data between different applications on different operating systems, today this format is commonly used in Windows for enhanced editing capabilities. The XHTML to RTF converter consists in an XSL stylesheet for parsing XHTML tags and generating their RTF equivalents.
New challenges of RTF:
- Extraction of text without consideration of any format information;
- Extraction and conversion of embedded image information;
- Conversion of the RTF layout and/or data into another format such as XML or HTML;
- Transferring RTF data into a custom data model.
Goals of designing the component:
To developed an application which can doing conversion from RTF to Text/XML/HTML
- Support for the current RTF;
- Open source C#.NET code;
- Unlimited usage in console, WinForms, WPF, and ASP.NET applications;
- Independence of third party components;
- Unicode support;
- Separation of parsing and the actual interpretation of the RTF data;
- Providing simple predefined conversion modules for text, images, XML, and HTML;
- Ready-to-Use RTF converter applications for text, images, XML, and HTML;
- Open architecture for simple creation of RTF converters.
Weak(points)
- The component offers no high-level functionality to create RTF content.
- The present RTF interpreter is restricted to content data and basic formatting options;
- There is no special support for the following RTF layout elements:
(Tables)
(Lists)
Automatic numbering
All features which require knowledge of how Microsoft Word might mean it.
In general, this should not pose a big problem for many areas of use. A conforming RTF writer should always write content with readers in mind that they do not know about tags and features which were introduced later in the standards history. As a consequence, a lot of the content in an RTF document is stored several times (at least if the writer cares about other applications). This is taken advantage of by the interpreter here, which just simply focuses on the visual content. Some writers in common use, however, improperly support this alternate representation which will result in differences in the resulting output.
Thanks to its open architecture, the RTF parser is a solid base for development of an RTF converter which focuses on layout.
2wayRTF2XML/XHTML - RTF Parser and convertion from rtf to xml, xtml
The actual parsing of the data is being done by the class RtfParser. Apart from the tag recognition, it also handles (a first level of) character encoding and Unicode support. The RTF parser classifies the RTF data into the following basic elements:
- RTF Group: A group of RTF elements;
- RTF Tag: The name and value of an RTF tag;
- RTF Text: Arbitrary text content (not necessarily visible!).
Figure 1.
The actual parsing process can be monitored by ParserListeners (Observer Pattern), which offers an opportunity to react on specific events and perform corresponding actions.
The integrated parser listener RtfParserListenerFileLogger can be used to write the structure of the RTF elements into a log file (mainly intended for use during development). The produced output can be customized using its RtfParserLoggerSettings. The additional RtfParserListenerLogger parser listener can be used to log the parsing process to any ILogger implementation (see System functions).
- The parser listener RtfParserListenerStructureBuilder generates the Structure Model from the RTF elements encountered during parsing. That model represents the basic elements as instances of IRtfGroup, IRtfTag, and IRtfText. Access to the hierarchical structure can be gained through the RTF group available in RtfParserListenerStructureBuilder.StructureRoot. Based on the Visitor Pattern, it is easily possible to examine the structure model via any IRtfElementVisitor implementation:
//-------------------------------------------------------------------------
public class MyVisitor : IRtfElementVisitor
{
void(RtfWriteStructureModel())
{
RtfParserListenerFileLogger logger =
new RtfParserListenerFileLogger( @"c:\temp\RtfParser.log" );
IRTFGroup structureRoot =
RtfParserTool.Parse( @"{\rtf1foobar}", logger );
structureRoot.Visit( this );
} // RtfWriteStructureModel
// ----------------------------------------------------------------------
void IRtfElementVisitor.VisitTag( IRtfTag tag )
{
Console.WriteLine( "Tag: " + tag.FullName );
} // IRtfElementVisitor.VisitTag
// ----------------------------------------
void IRtfElementVisitor.VisitGroup( IRtfGroup group )
{
Console.WriteLine( "Group: " + group.Destination );
foreach ( IRtfElement child in group.Contents )
{
child.Visit( this ); // recursive
}
} // IRtfElementVisitor.VisitGroup
// ----------------------------------------------------------------------
void IRtfElementVisitor.VisitText( IRtfText text )
{
Console.WriteLine( "Text: " + text.Text );
} // IRtfElementVisitor.VisitText
} // MyVisitor
Figure 2.
Note, however, that the same result for such simple functionality could be achieved by writing a custom IRtfParserListener (see below). This can, in some cases, be useful to avoid the overhead of creating the structure model in memory.
The utility class RtfParserTool offers the possibility to receive RTF data from a multitude of sources, such as string, TextReader, and Stream. And it allows, via its IRtfSource interface, to handle all these (and even other) scenarios in a uniform way.
The interface IRtfParserListener with its base utility implementation RtfParserListenerBase offers a way to react in custom ways to specific events during the parsing process:
// ------------------------------------------------------------------------
public class MyParserListener : RtfParserListenerBase
{
// ----------------------------------------------------------------------
protected override void DoParseBegin()
{
Console.WriteLine( "parse begin" );
} // DoParseBegin
// ----------------------------------------------------------------------
protected override void DoGroupBegin()
{
Console.WriteLine( "group begin -level " + Level.ToString() );
} // DoGroupBegin
// ----------------------------------------------------------------------
protected override void DoTagFound( IRtfTag tag )
{
Console.WriteLine( "tag " + tag.FullName );
} // DoTagFound
// ----------------------------------------------------------------------
protected override void DoTextFound( IRtfText text )
{
Console.WriteLine( "text " + text.Text );
} // DoTextFound
// ----------------------------------------------------------------------
protected override void DoGroupEnd()
{
Console.WriteLine( "group end -level " + Level.ToString() );
} // DoGroupEnd
// ----------------------------------------------------------------------
protected override void DoParseSuccess()
{
Console.WriteLine( "parse success" );
} // DoParseSuccess
// ----------------------------------------------------------------------
protected override void DoParseFail( RtfException reason )
{
Console.WriteLine( "parse failed: " + reason.Message );
} // DoParseFail
// ----------------------------------------------------------------------
protected override void DoParseEnd()
{
Console.WriteLine( "parse end" );
} // DoParseEnd
} // MyParserListener
Note that the used base class already provides (empty) implementations for all the interface methods, so only the ones which are required for a specific purpose need to be overridden.
RTF(Interpreter)
Once an RTF document has been parsed into a structure model, it is subject to interpretation through the RTF interpreter. One obvious way to interpret the structure is to build a Document Model which provides high-level access to the meaning of the document's contents. A very simple document model is part of this component, and consists of the following building blocks:
- Document info: title, subject, author etc.
- User properties
- Color information
- Font information
- Text formats
- Visuals:
Text with associated formatting information
(Breaks) : line, paragraph, section, page
Special characters: tabulator, paragraph begin/end, dash, space, bullet, quote, hyphen
(Images)
Figure 3. Rtf Converter for WPF
The various Visuals represent the recognized visible RTF elements, and can be examined with any IRtfVisualVisitor implementation.
The various Visuals represent the recognized visible RTF elements, and can be examined with any IRtfVisualVisitor implementation.
Analogous to the possibilities of the RTF parser, the provided RtfInterpreter supports monitoring the interpretation process with InterpreterListeners for specific purposes.
Analyzing documents might be simplified by using the RtfInterpreterListenerFileLogger interpreter listener, which writes the recognized RTF elements into a log file. Its output can be customized through its RtfInterpreterLoggerSettings. The additional RtfInterpreterListenerLogger interpreter listener can be used to log the interpretation process to any ILogger implementation (see System functions).
Figure 4. RTF Converter fpr Windows Forms
..\RtfConverter_exe_src_article\RtfWinForms\bin\Debug\PhS.Solutions.Community.RtfConverter.RtfWinForms.exe
Construction of the document model is also achieved through such an interpreter listener (RtfInterpreterListenerDocumentBuilder) which, in the end, delivers an instance of an IRtfDocument.
The following example shows how to make use of the high-level API of the document model:
// ----------------------------------------------------------------------
void RtfWriteDocumentModel( Stream rtfStream )
{
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );
IRtfDocument document = RtfInterpreterTool.BuildDoc( rtfStream, logger );
RtfWriteDocument( document );
} // RtfWriteDocumentModel
// ----------------------------------------------------------------------
void RtfWriteDocument( IRtfDocument document )
{
Console.WriteLine( "RTF Version: " + document.RtfVersion.ToString() );
// document info
Console.WriteLine( "Title: " + document.DocumentInfo.Title );
Console.WriteLine( "Subject: " + document.DocumentInfo.Subject );
Console.WriteLine( "Author: " + document.DocumentInfo.Author );
// ...
// fonts
foreach ( IRtfFont font in document.FontTable )
{
Console.WriteLine( "Font: " + font.Name );
}
// colors
foreach ( IRtfColor color in document.ColorTable )
{
Console.WriteLine( "Color: " + color.AsDrawingColor.ToString() );
}
// user properties
foreach ( IRtfDocumentProperty documentProperty in document.UserProperties )
{
Console.WriteLine( "User property: " + documentProperty.Name );
}
// visuals (preferably handled through an according visitor)
foreach ( IRtfVisual visual in document.VisualContent )
{
switch(visual.Kind)
{
case RtfVisualKind.Text:
Console.WriteLine( "Text: " + ((IRtfVisualText)visual).Text );
break;
case RtfVisualKind.Break:
Console.WriteLine( "Tag: " +
((IRtfVisualBreak)visual).BreakKind.ToString() );
break;
case RtfVisualKind.Special:
Console.WriteLine( "Text: " +
((IRtfVisualSpecialChar)visual).CharKind.ToString() );
break;
case RtfVisualKind.Image:
IRtfVisualImage image = (IRtfVisualImage)visual;
Console.WriteLine( "Image: " + image.Format.ToString() +
" " + image.Width.ToString() + "x" + image.Height.ToString() );
break;
}
}
} // RtfWriteDocument
As with the parser, the class RtfInterpreterTool offers convenience functionality for easy interpretation of RTF data and creation of a corresponding IRtfDocument. In case no IRtfGroup is yet available, it also provides for passing any source to the RtfParserTool for automatic on-the-fly parsing.
The interface IRtfInterpreterListener, with its base utility implementation RtfInterpreterListenerBase, offers the necessary foundation for a custom interpreter listener:
// ------------------------------------------------------------------------
public class MyInterpreterListener : RtfInterpreterListenerBase
{
// ----------------------------------------------------------------------
protected override void DoBeginDocument( IRtfInterpreterContext context )
{
// custom action
} // DoBeginDocument
// ----------------------------------------------------------------------
protected override void DoInsertText( IRtfInterpreterContext context, string text )
{
// custom action
} // DoInsertText
// ----------------------------------------------------------------------
protected override void DoInsertSpecialChar( IRtfInterpreterContext context,
RtfVisualSpecialCharKind kind )
{
// custom action
} // DoInsertSpecialChar
// ----------------------------------------------------------------------
protected override void DoInsertBreak( IRtfInterpreterContext context,
RtfVisualBreakKind kind )
{
// custom action
} // DoInsertBreak
// ----------------------------------------------------------------------
protected override void DoInsertImage( IRtfInterpreterContext context,
RtfVisualImageFormat format,
int width, int height, int desiredWidth, int desiredHeight,
int scaleWidthPercent, int scaleHeightPercent,
string imageDataHex
)
{
// custom action
} // DoInsertImage
// ----------------------------------------------------------------------
protected override void DoEndDocument( IRtfInterpreterContext context )
{
// custom action
} // DoEndDocument
} // MyInterpreterListener
The IRtfInterpreterContext passed to all of these methods contains the document information which is available at the very moment (colors, fonts, formats, etc.) as well as information about the state of the interpretation.
RTF Base Converters
As a foundation for the development of more complex converters, there are four base converters available for text, images, XML, and HTML. They are designed to be extended by inheritance.
Figure 5.
Text(Converter)
The RtfTextConverter can be used to extract plain text from an RTF document. Its RtfTextConvertSettings determines how to represent special characters, tabulators, white space, breaks (line, page, etc.), and what to do with them.
// ----------------------------------------------------------------------
void ConvertRtf2Text( Stream rtfStream )
{
// logger
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );
// text converter
RtfTextConvertSettings textConvertSettings = new RtfTextConvertSettings();
textConvertSettings.BulletText = "-"; // // replace default bullet text 'Ã,°'
RtfTextConverter textConverter = new RtfTextConverter( textConvertSettings );
// interpreter
RtfInterpreterTool.Interpret( rtfStream, logger, textConverter );
Console.WriteLine( textConverter.PlainText );
} // ConvertRtf2Text
Image(Converter)
The RtfImageConverter offers a way to extract images from an RTF document. The size of the images can remain unscaled or as they appear in the RTF document. Optionally, the format of the image can be converted to another ImageFormat. File name, type, and size can be controlled by an IRtfVisualImageAdapter. The RtfImageConvertSettings determines the storage location as well as any scaling.
// ----------------------------------------------------------------------
void ConvertRtf2Image( Stream rtfStream )
{
// logger
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );
// image converter
// convert all images to JPG
RtfVisualImageAdapter imageAdapter = new RtfVisualImageAdapter( ImageFormat.Jpeg );
RtfImageConvertSettings imageConvertSettings =
new RtfImageConvertSettings( imageAdapter );
imageConvertSettings.ImagesPath = @"c:\temp\images\";
imageConvertSettings.ScaleImage = true; // scale images
RtfImageConverter imageConverter = new RtfImageConverter( imageConvertSettings );
// interpreter
RtfInterpreterTool.Interpret( rtfStream, logger, imageConverter );
// all images are saved to the path 'c:\temp\images\'
} // ConvertRtf2Image
XML(Converter)
The RtfXmlConverter converts the recognized RTF visuals into an XML document. Its RtfXmlConvertSettings allows to specify the used XML namespace and the corresponding prefix.
// ----------------------------------------------------------------------
void ConvertRtf2Xml( Stream rtfStream )
{
// logger
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );
// interpreter
IRtfDocument rtfDocument = RtfInterpreterTool.BuildDoc( rtfStream, logger );
// XML convert
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Indent = true;
xmlWriterSettings.IndentChars = ( " " );
string fileName = @"c:\temp\Rtf.xml";\
using ( XmlWriter writer = XmlWriter.Create( fileName, xmlWriterSettings ) )
{
RtfXmlConverter xmlConverter = new RtfXmlConverter( rtfDocument, writer );
xmlConverter.Convert();
writer.Flush();
}
} // ConvertRtf2Xml
HTML(Converter)
The RtfHtmlConverter converts the recognized RTF visuals into an HTML document. File names, type, and size of any encountered images can be controlled through an IRtfVisualImageAdapter, while the RtfHtmlConvertSettings determines storage location, stylesheets, and other HTML document information.
// ----------------------------------------------------------------------
void ConvertRtf2Html( Stream rtfStream )
{
// logger
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );
// image converter
// convert all images to JPG
RtfVisualImageAdapter imageAdapter = new RtfVisualImageAdapter( ImageFormat.Jpeg );
RtfImageConvertSettings imageConvertSettings =
new RtfImageConvertSettings( imageAdapter );
imageConvertSettings.ScaleImage = true; // scale images
RtfImageConverter imageConverter =
new RtfImageConverter( imageConvertSettings );
// interpreter
IRtfDocument rtfDocument = RtfInterpreterTool.Interpret( rtfStream,
logger, imageConverter );
// html converter
RtfHtmlConvertSettings htmlConvertSettings =
new RtfHtmlConvertSettings( imageAdapter );
htmlConvertSettings.StyleSheetLinks.Add( "default.css" );
RtfHtmlConverter htmlConverter = new RtfHtmlConverter( rtfDocument,
htmlConvertSettings );
Console.WriteLine( htmlConverter.Convert() );
} // ConvertRtf2Html
HTML Styles can be integrated in two ways:
- Inline through RtfHtmlCssStyle in RtfHtmlConvertSettings.Styles
- Link through RtfHtmlConvertSettings.StyleSheetLinks
The RtfHtmlConvertScope allows to restrict the target range:
- RtfHtmlConvertScope.All: complete HTML document (=Default)
- ...
- RtfHtmlConvertScope.Content: only paragraphs
RTF Converter Applications
The console applications Rtf2Raw, Rtf2Xml, and Rtf2Html demonstrate the range of functionality of the corresponding base converters, and offer a starting point for the development of our own RTF converter.
Rtf2Raw()
The command line application Rtf2Raw converts an RTF document into plain text and images:
Rtf2Raw source-file [destination] [/IT:format] [/CE:encoding] [/IS+] [/ST-] [/SI-] [/LD:path] [/LP] [/LI] [/D] [/O] [/?]
source-file source rtf file destination destination directory (default=source-file directory) /IT:format images type format: bmp, emf, exif, gif, icon, jpg, png, tiff or wmf (default=original) /CE:encoding character encoding: ASCII, UTF7, UTF8, Unicode, BigEndianUnicode, UTF32, OperatingSystem (default=UTF8) /IS+ image scale (default=off) /ST- don't save text to the destination (default=on) /SI- don't save images to the destination (default=on) /LD:path log file directory (default=destination directory) /LP write rtf parser log (default=off) /LI write rtf interpreter log (default=off) /D write text to screen (default=off) /O open text in associated application (default=off) /? this help
Samples: Rtf2Raw(MyText.rtf) Rtf2Raw MyText.rtf c:\temp Rtf2Raw MyText.rtf c:\temp /CSS:MyCompany.css Rtf2Raw MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css Rtf2Raw MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css /IT:png Rtf2Raw MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css /IT:png /LD:log /LP /LI |