specialized HTML generation

does anybody have an idea if I can use libreoffice to convert and "ordinary document" into 300 word chunks, each chunk in its own HTML page or, preferably, HTML fragment.

--- eric

Eric,

does anybody have an idea if I can use libreoffice to convert and "ordinary
document" into 300 word chunks, each chunk in its own HTML page or, preferably,
HTML fragment.

--- eric

Writer will save a document in html if you use SAVE AS and select html.
I am not sure how easy it would be break up a document into 300 word
chunks and get the results you want. You break the document up using
page breaks and then copy/paste each page into another document and save
the second document as html.

I am not sure of how clean the html will be. Most programs insert extra
formating that makes understanding what is going on difficult.

another way to think of this problem is formatting pages as if you're going to make a paperback book. Once you put in a hard break (which word processor should do automatically), you generate the next page (HTML or paper). I expect one could create the approximate word count by using a stylesheet with a restricted page size. next trick being splitting them into separate pages.

I did something like that using WordStar years ago, to make help files of pages that fit on 24 x 80 screens. I also used database reports with markup around the fields that turned the text into RTF when I printed the text to disk. Here are some observations that may give you some ideas:

1. The hard part is the 300-word chunks. If you really mean exactly 300 words (and not the average writing assumption of the space that 300 5-character words take), you need to process the text file in some sort of program that inserts separators of some kind every 300 words.

2. If you mean 300 words on the average, what is considered an average amount of text on a page in a paper, you can set up the document to have pages that fit 300 words of text on the average. Just fiddle with the margins until the right amount is on a page after you load up the text as a document.

3 Then you can experiment with the File | Send > Create HTML Document to see if it will make a set of web pages for you. My experience is that the converter for this is very flaky, but it might work for the text you have if it is simple.

4. Another solution is to get the page size to where it holds 300 words of text and then save it as text (not text encoding, which you can try, but just .text).

5. Then to make separators, you can create page headings and footer that insert something like the literal characters "<p>" at the top of the page and "</p>" at the bottom of the page. (Pray that the Save As ... .text keeps headers and footers.)

6. Rename as .html and look at it in your browser. Most of the extra white space will go away and you'll have those blocks of words as HTML paragraphs. You might be able to load the .html into LibreOffice Writer and resave it as HTML to clean it up.

(Warning: If your text has "&", "<", and ">" characters in it, you'll have to do more to prevent them appearing in the HTML in a way that has them be mistaken as markup and treated incorrectly.)

- Dennis

  1. The hard part is the 300-word chunks. If you really mean exactly 300 words (and not the average writing assumption of the space that 300 5-character words take), you need to process the text file in some sort of program that inserts separators of some kind every 300 words.

  2. If you mean 300 words on the average, what is considered an average amount of text on a page in a paper, you can set up the document to have pages that fit 300 words of text on the average. Just fiddle with the margins until the right amount is on a page after you load up the text as a document.

that was a good suggestion. when I moved the lower margin up to about 7 inches, I get almost exactly 300 words on page just close enough to perfect.

  3 Then you can experiment with the File | Send> Create HTML Document to see if it will make a set of web pages for you. My experience is that the converter for this is very flaky, but it might work for the text you have if it is simple.

That doesn't work. I can save as HTML and get a single file but I'll need to run a preprocessor over it to clean it up. Unfortunately, I also lose the soft page breaks and header information. converting those soft breaks to heartbreaks is important during export because it should maintain the style across page boundaries.

  4. Another solution is to get the page size to where it holds 300 words of text and then save it as text (not text encoding, which you can try, but just .text).

  5. Then to make separators, you can create page headings and footer that insert something like the literal characters "<p>" at the top of the page and"</p>" at the bottom of the page. (Pray that the Save As ... .text keeps headers and footers.)

problem with this is it loses style information with images etc. but other than that, it's a good suggestion.

Hi Eric,

Eric S. Johansson schrieb:
[..]

3 Then you can experiment with the File | Send> Create HTML Document
to see if it will make a set of web pages for you. My experience is
that the converter for this is very flaky, but it might work for the
text you have if it is simple.

That doesn't work. I can save as HTML and get a single file but I'll
need to run a preprocessor over it to clean it up. Unfortunately, I also
lose the soft page breaks and header information. converting those soft
breaks to heartbreaks is important during export because it should
maintain the style across page boundaries.

This export uses the content to divide it into several files. When you choose "Heading 1" from the drop-down list, you will get one html-file for each chapter.

Kind regards
Regina

works nicely per chapter but I need per page. If I could stuff something in the header, it would be ideal

See if you can create a page header that has a Header paragraph style. I know that is strange, but it is technically allowed in the ODF. It may not work, or the Send to HTML may not do anything with it, ... but you never know.

- Dennis

PS: I always found, except for the lucky case of exporting from a database to text that was really HTML/RTF, that post-processing of some sort was required.

PPS: When you said text in the original problem statement, I assumed plaintext and not formatted text. Preserving layout beyond paginated text flow is obviously more difficult.

PPPS: Are you trying to make a paperback book (on paper)? Or the equivalent as a set of web pages (i.e., there is navigation among the pages for page-turning, getting back to the index, tops of sections/chapters, etc)? It seems to me that specialized tools for this sort of thing might be preferable than trying to shoe-horn it into Writer. Also, you might look at how slide-shows are done on the web. I am not sure that Impress provides any kind of import that would help with this, though. -- Just random thoughts from me at this point.

Eric

> 1. The hard part is the 300-word chunks. If you really mean exactly 300 words (and not the average writing assumption of the space that 300 5-character words take), you need to process the text file in some sort of program that inserts separators of some kind every 300 words.
>
> 2. If you mean 300 words on the average, what is considered an average amount of text on a page in a paper, you can set up the document to have pages that fit 300 words of text on the average. Just fiddle with the margins until the right amount is on a page after you load up the text as a document.
that was a good suggestion. when I moved the lower margin up to about 7
inches, I get almost exactly 300 words on page just close enough to perfect.

> 3 Then you can experiment with the File | Send> Create HTML Document to see if it will make a set of web pages for you. My experience is that the converter for this is very flaky, but it might work for the text you have if it is simple.

That doesn't work. I can save as HTML and get a single file but I'll need to run
a preprocessor over it to clean it up. Unfortunately, I also lose the soft page
breaks and header information. converting those soft breaks to heartbreaks is
important during export because it should maintain the style across page boundaries.
> 4. Another solution is to get the page size to where it holds 300 words of text and then save it as text (not text encoding, which you can try, but just .text).
>
> 5. Then to make separators, you can create page headings and footer that insert something like the literal characters "<p>" at the top of the page and"</p>" at the bottom of the page. (Pray that the Save As ... .text keeps headers and footers.)
problem with this is it loses style information with images etc. but other than
that, it's a good suggestion.

If you trying to each 300 word block its own page one way to get consistent formatting across web pages is to use an external CSS sheet with the default formatting you want

The CSS (cascading style sheet) will have the format information. Each
web page must reference the page for the formating to work. I have a
link with more information about creating web pages and css style sheets

http://www.w3schools.com/default.asp

good idea but wrong context. For future reference, how do you tell writer to use a particular stylesheet when you are working on a document and producing the HTML output?

Well is talk about was a *different problem ^entirely. This example only make* sense in HTML. The carrot marks the 300 word mark. If you are working in a word processor, the bold section would continue from one page to the other page automatically. In HTML I would need to close off all formatting and then reopen it on the next page just like a word processor does.

I should probably explain why I'm trying to do this so it makes more sense. I'm doing an experiment for online literary magazine. One of the problems with putting writing on the net is that HTML is not formatted for reading. People's eyes need to take a break and we have become accustomed to a 300 word chunks as is found on most books. I don't know if that was an artifact of human wiring or mechanics of the printing process but, it seems to work. Putting writing into HTML is up with a page that is both too wide and too long for easy reading.

My experiment involves automatically producing 300 word pages that can be lightly massaged into HTML for presentation online in a variety of different formats. traditional or tabloid width, single column or dual column and see which works well.

Yes, I could take the page structure I have now and cut and paste each page into an HTML editor but, I'm not doing this once or twice. I'm going to be doing this multiple times for a series of months and I'd like something automate the process. In the future, if the experiment pans out, it'll be worth it to write explicit code to do the parsing and the format checking etc. etc. and out probably start from books using the epub format. But today, it seemed like it would be so simple to use writer to do most of the heavy lifting for me. It would be really nice if one could simply tell the writer to use writer (need to come up with better names :-), hand the document to the editor who makes the work readable and then they run a macro which converts a document to HTML form and an automated process pushes the HTML form online.

I hope that gives you a better understanding of why I'm trying to do this 300 word per page break up. It's probably a horrible abuse of writer to use it as document prep. I'm open to other tools that could be used to do last-minute adjustments and then automatic preparation.

Eric

> If you trying to each 300 word block its own page one way to get consistent formatting across web pages is to use an external CSS sheet with the default formatting you want
>
> The CSS (cascading style sheet) will have the format information. Each
> web page must reference the page for the formating to work. I have a
> link with more information about creating web pages and css style sheets
>
> http://www.w3schools.com/default.asp
>
>

good idea but wrong context. For future reference, how do you tell writer to use
a particular stylesheet when you are working on a document and producing the
HTML output?

Well is talk about was a *different problem ^entirely. This example only make*
sense in HTML. The carrot marks the 300 word mark. If you are working in a word
processor, the bold section would continue from one page to the other page
automatically. In HTML I would need to close off all formatting and then reopen
it on the next page just like a word processor does.

I should probably explain why I'm trying to do this so it makes more sense. I'm
doing an experiment for online literary magazine. One of the problems with
putting writing on the net is that HTML is not formatted for reading. People's
eyes need to take a break and we have become accustomed to a 300 word chunks as
is found on most books. I don't know if that was an artifact of human wiring or
mechanics of the printing process but, it seems to work. Putting writing into
HTML is up with a page that is both too wide and too long for easy reading.

My experiment involves automatically producing 300 word pages that can be
lightly massaged into HTML for presentation online in a variety of different
formats. traditional or tabloid width, single column or dual column and see
which works well.

Yes, I could take the page structure I have now and cut and paste each page into
an HTML editor but, I'm not doing this once or twice. I'm going to be doing this
multiple times for a series of months and I'd like something automate the
process. In the future, if the experiment pans out, it'll be worth it to write
explicit code to do the parsing and the format checking etc. etc. and out
probably start from books using the epub format. But today, it seemed like it
would be so simple to use writer to do most of the heavy lifting for me. It
would be really nice if one could simply tell the writer to use writer (need to
come up with better names :-), hand the document to the editor who makes the
work readable and then they run a macro which converts a document to HTML form
and an automated process pushes the HTML form online.

I hope that gives you a better understanding of why I'm trying to do this 300
word per page break up. It's probably a horrible abuse of writer to use it as
document prep. I'm open to other tools that could be used to do last-minute
adjustments and then automatic preparation.

OK, the experiment is really how good is the html code produced by
Writer for use in a web page with limited final editing of the html.

Ironically, I am working on a couple projects that are similar to what
you are describing. The projects are to convert a few out-of-print books
to web pages for a very elderly author.

If you want, we can talk off list about more of the details of how to do
this.