Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: MS Word to XHTML

__/ [SpaceGirl] on Sunday 11 September 2005 20:46 \__

> Roy Schestowitz wrote:
>> __/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__
>> 
>> 
>>>On Sun, 11 Sep 2005, SpaceGirl wrote:
>>>
>>>
>>>>Alan J. Flavell wrote:
>>>
>>>[comprehensive quote of my posting, without apparently having anything
>>>relevant to say about it.]
>>>
>>>
>>>>Word XP and upwards stores its documents in XML format doesn't it?
>>>
>>>So what?  XML is only a format for defining markup.  If the markup
>>>doesn't do anything meaningful (specifically - if it only creates a
>>>visual result on a printed page, without having any significant
>>>structure) then it's not going to turn into effective HTML: it'd just
>>>be the usual garbage in / garbage out that we're accustomed to with
>>>Word conversions to soi-disant "web" format.
> 
> Word documents, being style based, are easy to convert. Use XSLT to
> strip out all the crap so that all you end up with is basic HTML - <p>'s
> and <h>'s. I wasn't suggested that anything more complicated that that
> should be attempted - but I HAVE seen it done pretty successfully with
> Word 2003 files. In the case of that client (although I wasn't part of
> the team who wrote those tools), their customers would submit Word
> documents and the XSLT would convert them into both HTML and PDFs, and
> the reproduction was almost perfect (styling and colours anyway).
> 
>>>>You could probably write your own XSLT to turn in into HTML fairly
>>>>easily.
>>>
>>>There seems to be some kind of conceptual disconnect here. Most Word
>>>documents (in my experience) simply don't contain the necessary
>>>structure for useful conversion to HTML: they've been created as a
>>>purely visual construction for printing onto paper.  It's irrelevant
>>>what underlying technology you use (RTF, XML, SGML, whatever) - the
>>>problem is that the source material simply does not represent the
>>>needed structures, *because the document authors do not put it there*.
> 
> That wasn't what I saw, but like I said I wasn't on that team. As far as
> I could tell they wrote a simple parser.


I believe that's possible, but it depends on the standard that the author
sticks to. Word does not /force/ the author to add structural information.
Hence, hacks are allowed which leave bits hanging aloof.


>>>You might as well try to convert cheese into fresh cream: both are
>>>fine milk products, it's true, but instead of trying to convert the
>>>one into the other, you'd do better to produce them both starting from
>>>fresh milk.  And the kind of "fresh milk" that's needed here is
>>>logically structured text markup.  Not visual formatting.  Until the
>>>authors of Word documents can grasp that, the prospects for conversion
>>>of Word to web formats are poor, IMHO.
> 
> Strange, as I've never had a problem. Generally I have to do it in a
> sort of round-robin of programs; First save your Word documents as PDF,
> then save the PDF as a web page. It works just fine.


I have had bad experiences converting PDF's to HTML. I even wrote about this
very <http://schestowitz.com/Weblog/archives/2005/05/24/pdf-to-html/>
particular conversion because I found it frustrating. PDF involves
embedment of objects to fit the media, e.g. A4 paper, so it is bound to
lose what is necessary for a good conversion.


> <snip stuff I cant be bothered to read, seeing as everyone else is being
> so fucking rude>


Are you referring to me? Did I say anything rude? Please clarify if
possible.

Roy

[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index