kindle thread

Thread replies: 8
Thread images: 1

Anonymous
kindle thread 2016-07-14 10:25:38 Post No. 8277193
[Report] Image search: [Google]

File: NSE6h36[1].png (17 KB, 579x530) Image search: [Google]

kindle thread Anonymous 2016-07-14 10:25:38 Post No. 8277193 [Report]

How to convert pdf to azw3/mobi in such a way you get normal paragraphs? I converted it in default settings and got this. It's indeed readable, but still wish it wasn't so messy.

>>

Anonymous 2016-07-14 11:44:18 Post No.8277308
[Report]

Anonymous 2016-07-14 11:44:18 Post No.8277308 [Report]

>>8277193
As far as I know a PDF doesn't contain any explicit information about how it was formatted, in this regard it works more like an image than a text file. The only way of extracting the raw text is through optical character recognition (OCR), where each character is isolated and identified through comparisons with known characters. Often the user of the OCR software needs to manually identify the first occurrences of some characters before the software can reliably recognise them.

As you can imagine the structure of a document is even less readily available to a digital interpreter than the characters are. For instance, say you press enter as you reach the end of a row in a text editor. The next word would have appeared on the next row either way, but now there's a forced line break that can't possibly be identified optically. Similarly, all formatting methods need to be inferred from the clues given by the optical layout of the actual text.

This process is called document layout analysis and is to some extent use in all OCR software, but the accuracy of which it can analyse a text can obviously vary. I imagine the automatic OCR that took place when you "converted" your PDF didn't employ a very sophisticated form of layout analysis, so you should try to find some software that is more powerful in that regard, perhaps some that give you more control of the interpretations made also.

>>

Anonymous 2016-07-14 11:49:09 Post No.8277317
[Report]

Anonymous 2016-07-14 11:49:09 Post No.8277317 [Report]

>>8277308
>As far as I know a PDF doesn't contain any explicit information about how it was formatted
I should have written "...about how it was originally constructed"

>>

Anonymous 2016-07-14 16:08:54 Post No.8277923
[Report]

Anonymous 2016-07-14 16:08:54 Post No.8277923 [Report]

>>8277193

What book is it? I bet libgen has an epub which is very easy to convert to mobi using calibre.

>>

Anonymous 2016-07-14 17:01:29 Post No.8278081
[Report]

Anonymous 2016-07-14 17:01:29 Post No.8278081 [Report]

>>8277308
Do you ever think about not opening your mouth when u don't know wtf you are talking about? I hope u dont do this offline also.

A pdf can contain many kind of media, like text, images, sound, 3d, ++. If the input media is text, then you should be able to convert that text fairly easily to another format. It has NOTHING to do with OCR.

If the original input media is an image or a picture of text however, you will have to use some kind of OCR(optical character recognition) software to convert the imagetext to ascii text or utf or whatever.

If the pdf contains text(not an image of text) you can just select the text in the editor and copy&paste it somewhere else.

>>

Anonymous 2016-07-14 19:56:42 Post No.8278653
[Report]

Anonymous 2016-07-14 19:56:42 Post No.8278653 [Report]

>>8278081
*-sound.

>>

Anonymous 2016-07-14 20:06:35 Post No.8278678
[Report]

Anonymous 2016-07-14 20:06:35 Post No.8278678 [Report]

>>8277193
1. Get your pdf OCR'd. If you can select text, it already is. If you cant, pirate ABBYY FineReader and run it through.
2. Crop PDF. Use adobe portable, crop out page numbers, headers, and footers if possible. Use Adobe tools to remove hidden information. Do NOT remove hidden text, that de-OCRs the document.
3a. Proof OCR. This spoils the book. Otherwise:
4. Using acrobat, save as .rtf format
5. Edit the Document. Use replace functions to remove paragraphs unless they end with a period, in which case add another paragraph.
6. Use calibre to create an epub from your RTF. This should be a readable version of the PDF.

Alternately if you have a large screen model you could just save the pdf in epub as images, but the margins never work for me

>>

Anonymous 2016-07-14 21:57:33 Post No.8279041
[Report]

Anonymous 2016-07-14 21:57:33 Post No.8279041 [Report]

>>8278678
Have you done this on many books? I am interested in this. How well does ABBYY perform? Is there need for much editing, does it get a lot of words/characters wrong?