[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
kindle thread
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /lit/ - Literature

Thread replies: 8
Thread images: 1
File: NSE6h36[1].png (17 KB, 579x530) Image search: [Google]
NSE6h36[1].png
17 KB, 579x530
How to convert pdf to azw3/mobi in such a way you get normal paragraphs? I converted it in default settings and got this. It's indeed readable, but still wish it wasn't so messy.
>>
>>8277193
As far as I know a PDF doesn't contain any explicit information about how it was formatted, in this regard it works more like an image than a text file. The only way of extracting the raw text is through optical character recognition (OCR), where each character is isolated and identified through comparisons with known characters. Often the user of the OCR software needs to manually identify the first occurrences of some characters before the software can reliably recognise them.

As you can imagine the structure of a document is even less readily available to a digital interpreter than the characters are. For instance, say you press enter as you reach the end of a row in a text editor. The next word would have appeared on the next row either way, but now there's a forced line break that can't possibly be identified optically. Similarly, all formatting methods need to be inferred from the clues given by the optical layout of the actual text.

This process is called document layout analysis and is to some extent use in all OCR software, but the accuracy of which it can analyse a text can obviously vary. I imagine the automatic OCR that took place when you "converted" your PDF didn't employ a very sophisticated form of layout analysis, so you should try to find some software that is more powerful in that regard, perhaps some that give you more control of the interpretations made also.
>>
>>8277308
>As far as I know a PDF doesn't contain any explicit information about how it was formatted
I should have written "...about how it was originally constructed"
>>
>>8277193

What book is it? I bet libgen has an epub which is very easy to convert to mobi using calibre.
>>
>>8277308
Do you ever think about not opening your mouth when u don't know wtf you are talking about? I hope u dont do this offline also.

A pdf can contain many kind of media, like text, images, sound, 3d, ++. If the input media is text, then you should be able to convert that text fairly easily to another format. It has NOTHING to do with OCR.

If the original input media is an image or a picture of text however, you will have to use some kind of OCR(optical character recognition) software to convert the imagetext to ascii text or utf or whatever.

If the pdf contains text(not an image of text) you can just select the text in the editor and copy&paste it somewhere else.
>>
>>8278081
*-sound.
>>
>>8277193
1. Get your pdf OCR'd. If you can select text, it already is. If you cant, pirate ABBYY FineReader and run it through.
2. Crop PDF. Use adobe portable, crop out page numbers, headers, and footers if possible. Use Adobe tools to remove hidden information. Do NOT remove hidden text, that de-OCRs the document.
3a. Proof OCR. This spoils the book. Otherwise:
4. Using acrobat, save as .rtf format
5. Edit the Document. Use replace functions to remove paragraphs unless they end with a period, in which case add another paragraph.
6. Use calibre to create an epub from your RTF. This should be a readable version of the PDF.

Alternately if you have a large screen model you could just save the pdf in epub as images, but the margins never work for me
>>
>>8278678
Have you done this on many books? I am interested in this. How well does ABBYY perform? Is there need for much editing, does it get a lot of words/characters wrong?
Thread replies: 8
Thread images: 1

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.