Introduction Link to heading
I found a manuscript earlier this year (2019), written by my dad about his time in Antarctica between 1958 and 1965. Although he passed away in 1996, I thought it would be nice to publish it, primarily for the benefit of surviving family members. It had been typewritten rather than written longhand, so I didn’t think it would be too difficult.
The manuscript was in an envelope postmarked 28 October 1965, so I’m reasonably confident it was over 50 years old. The paper was certainly yellowing and looking old. It also didn’t look like the original, given the type was slightly faded, fuzzy and indistinct, so I assume it was either a carbon copy from the typewriter or made with an early copying process such as a photostat. Here’s the raw opening line for example:
Suffice to say, it wasn’t quite as easy as I thought, but here are the key steps and lessons learned.
Scanning Link to heading
The first step was scanning. I didn’t have a sheet feeder on my home scanner, and didn’t want to risk getting the fragile pages jammed in a sheet feeder in a shop or at work, so I scanned it one page at a time. Before scanning all 116 pages across 8 chapters, I scanned one chapter first, and experimented with the OCR settings. I used the highest DPI setting (600dpi), and scanned the whole chapter into a multi-page PDF rather than individual JPG image files.
Optical Character Recognition (OCR) Link to heading
Using the handy Ubuntu page on OCR as a starting point, I decided to use Tesseract OCR.
A couple of notes about Tesseract:
- It only works with TIFF files.
- The TIFF files need to be monochrome, i.e. only contain black and white (note that this is different from grayscale).
Extracting the TIFF files Link to heading
It turns out that, at least with my scanner, a PDF is just a wrapper for an image file, and a multi-page PDF is just multiple images with one image per page. ImageMagick has a handy convert input.pdf[pagenumber] outputpage.tif
tool to extract an individual TIFF from a single page. However, the default configuration, at least on Ubuntu 18.4, doesn’t allow convert
to read the PDF files, giving a “convert:not authorized” error, so the first step was to edit /etc/ImageMagick-6/policy.xml and change the line:
<policy domain="coder" rights="none" pattern="PDF" />
to
<policy domain="coder" rights="read|write" pattern="PDF" />
Note: This change may be overwritten when imagemagick is upgraded.
Once the convert
tool was working, it could be combined with pdfinfo
to count the number of pages in a PDF, and all pages could be extracted to individual TIFF files via a shell script such as:
INFILE="chapter1.pdf"
pages=$(pdfinfo ${INFILE} | grep Pages | awk '{print $2}')
for page in `seq 1 $pages`; do
file="page"$page".tif"
convert $INFILE\[$(($page - 1 ))\] $file
done
I know I could have skipped all of this by simply scanning each individual page to an image file in the first place, but it seemed tidier to have the master scans in 8 chapter PDF files rather than 116 individual page TIFF files.
Converting to monochrome Link to heading
The suggested way of converting to monochrome is to use the convert -monochrome
flag, e.g. in a shell script:
convert -monochrome -density 600 $INFILE\[$(($page - 1 ))\] page.tif
However, I found the monochrome option introduced a halftone effect like newsprint which distorted the text and made the OCR output jibberish, for example:
After a fair bit of experimentation, including using Gimp to get quick visual feedback on the effects of different processing options, I settled on a two step process:
-
Remove all colour and leave only two colours (black and white) in the picture even though technically the file was still colour.
-
Convert that file to a monochrome file.
Within the first step, i.e. effectively converting to black and white, there were two key steps:
- Reduce the brightness and increase contrast so that the background was as light as possible and the text was dark as possible. Remember one of the issues was that the paper was yellowing from age, and I didn’t want any of that yellow turning to black blobs in the final monochrome files in case the OCR turned them into garbage characters. For example:
- Adjust the threshold (called levels in Gimp) so only pure white and pure black remain. Target was for the text to be black and paper to be white, with the black text as clear as possible. Setting the threshold too high would result in spindly and broken characters, and too low would turn the characters into black blobs. Remember also that one of the issues was that the original characters were slightly fuzzy and indistinct in the first place, given it was a carbon copy or photostat. For example:
I also found that a couple of chapters were on a different type of paper which hadn’t yellowed as much, so different settings were needed for those. I also put in a -grayscale and -despeckle for good measure.
For reference, the settings and values for all chapters apart from chapters 4 and 7 were:
convert -grayscale Rec709Luminance -brightness-contrast -50x80 -despeckle -density 600 -white-threshold 80 -black-threshold 80 $INFILE\[$(($page - 1 ))\] tempgray.tif
and the settings for chapter 4 and 7 were:
convert -brightness-contrast -60x100 -white-threshold 1 -black-threshold 99 -density 600 $INFILE\[$(($page - 1 ))\] tempgray.tif
I wouldn’t expect these actual values to work universally, but a similar approach with different values may be useful for other old and faded documents. I’d imagine it might not be necessary at all if your source images are on clean white paper with crisp, well-defined black text.
Once the source pages were just two colours (despite being in a multicolour file), I was able to use the convert -monochrome
flag without it getting mangled by dithering:
convert -monochrome tempgray.tif tempmono.tif
Outputting for example:
Convert to text, and manually correct Link to heading
Once the source pages were in the correct format and as clear as possible, it was simply a case of running the OCR on each of the files:
tesseract tempmono.tif tempoutput
The output text wasn’t perfect, but given the issues with the original source manuscript I thought it was reasonably good in the end. I also found that Tesseract 4 was much better than Tesseract 3.
I was able to read and correct all the text files manually on the commute to work over the space of a few days. Common errors included using a instead of e, c instead of o, and (didn’t spot this one until in the PDF) O (capital o) instead of 0 (zero).
Typesetting Link to heading
Now I had the book in a series of text files, I needed to select a Print on Demand (PoD) service, and prepare the electronic manuscript.
As much as I’d like to say I chose an independent Print on Demand service, for my first foray into PoD I picked one of the more well-known ones - Amazon’s Kindle Direct Publishing (KDP). They needed the manuscript in a PDF file.
Being a bit of a fan of Markdown, I initially thought I’d keep the book master source in Markdown, and use a tool such as pandoc
to convert to another format before converting PDF. But given I really just wanted printed editions of the book, I thought I’d take the plunge and learn LaTeX1, and keep the book master source in that format.
Installing LaTeX Link to heading
One of the first things I found was that there are different types of LaTeX installs. On trying a few simple example documents I was often getting some strange errors. It turns out I had installed via:
sudo apt install texlive-base
when some of the examples needed:
sudo apt install texlive-latex-recommended
Creating the LaTeX template Link to heading
I had decided initially on a “trim size” of 6" x 9", wanting something slightly larger than a standard paperback, given it was non-fiction and slightly academic in nature.
After a lot of searching for LaTeX KDP templates, and reading lots of convoluted instructions online, I eventually found it needn’t be especially complicated. In this case I thought it would be better to start from a blank template and keep adding to it until I got what I wanted so that I understood every change, rather than to take a big existing template that had a whole bunch of stuff I didn’t understand and try to get to where I wanted by reverse engineering. Two key elements were:
- Document class. Main options for a book are book, scrbook and memoir. Seems there’s a fair bit of debate about the merits of each, but I settled on book for no particular reason other than that the default output was closer to what I wanted:
\documentclass[]{book}
- Page size. I couldn’t find anything online I could directly reuse, but one tip was to temporarily enable the
\usepackage{layout}
and then insert two pages showing margins etc. via\layout
, so you could see the settings it calculates for you and the effects of any changes. For the 6" x 9" trim I’d initially chosen, I set the following which seemed to work:
\usepackage[paperwidth=6in, paperheight=9in, left=1in, marginparsep=0in, marginparwidth=0in]{geometry}
The individual chapters were included via e.g.
\input{ch1.tex}
And I created a table of contents, front matter etc. and structured as follows:
\begin{document}
\frontmatter
\maketitle
\input{frontmatter.tex}
\tableofcontents
\input{ch0.tex}
\mainmatter
\pagenumbering{arabic}
\input{ch1.tex}
...
\end{document}
Converting the chapters to LaTeX Link to heading
There were only actually 4 main changes I needed to make to the plain text chapter source files for LaTeX:
-
Add
\chapter{}
and\section{}
. -
Change the double quote characters to
``
and''
for left and right quotes respectively. -
Escape all instances of
%
, i.e. change to\%
. -
Fix all the degree symbols (of which there were many in a book about Antarctica). Unfortunately I couldn’t get one of the “proper” ways of doing it working, e.g. the
\usepackage{gensymb}
and\celsius
, so had to use the “hack” of using$^\circ$
.
If I ever need to change any of these, I can do a global search and replace with a command like:
sed -i 's/\$^\\circ\$/\\celsius/g' *.tex
Adding images Link to heading
There were also some faded photos with the manuscript, so I thought it would be good to include some of those. I scanned, and cleaned up some of the dust specks with the clone tool in Gimp.
I didn’t want a figure number in the caption, and preferred to see them at the bottom of the page, so included with:
\begin{figure}[!b]
\includegraphics[width=\textwidth]{<filename>.jpg}
\captionsetup{labelformat=empty}
\caption{<description>}
\end{figure}
Putting it all together Link to heading
Compiling the book with the following:
pdflatex antarctica.tex
Generated quite a nice looking PDF. There were a few tweaks I wanted, e.g. to remove the text in the headings (\pagestyle{plain}
) and allow chapters to start on the left to remove all the blank left pages (changed document class to \documentclass[openany]{book}
). But it hadn’t been as much work as I’d expected based on initial reading.
It was now ready to upload to KDP. I was half expecting issues or warnings, but KDP seemed to accept the PDF just fine.
The main thing I noticed was that it was going to be a very slim book (6mm) so I decided to lower the trim size to 5.5" x 8.5" to make it slightly thicker (7mm). This was again relatively straightforward, changing the geometry to:
\usepackage[paperwidth=5.5in, paperheight=8.5in, left=0.75in, marginparsep=0in, marginparwidth=0in]{geometry}
Designing the cover Link to heading
I tried the cover designer in KDP, but all the templates were a bit naff, so I downloaded a blank template for my trim size and book width and designed my own in Gimp.
When I came to upload my own I realised that they needed the cover in PDF format, but with what I’d learnt above it was a simple matter of:
convert cover.jpg cover.pdf
I requested a printed proof for a final check, and there it was! Just had to hit publish, and order some author copies to give out to family and friends, and job done.
Conclusion Link to heading
Family members have been delighted with the results, my mother in particular getting quite emotional about it, so it has been an exceedingly worthwhile project in that respect.
Perhaps the information here will also be of help to someone else trying to do something similar.
Costs were very low, given I already had a scanner, used open source software, used the Amazon provided ISBN, and the copies I distributed were the author copies at cost-price (plus delivery costs).
Source is at https://gitlab.com/michael-lewis/yearsonice . This isn’t currently a public repo so please request access if you are interested. If I do give access, please respect the copyright, especially for the photos which were taken while in the employment of (what is now) the British Antarctic Survey.
I don’t have immediate plans for an eBook version, given the target audience is primarily elderly relatives, but it shouldn’t be difficult to do.
And is there anything I’d change if doing a new edition? Firstly, the bottom margin is a tiny bit larger than I’d have liked, and I only realised this with the proof, but given the limited audience and that I hadn’t found any other bigger changes I decided to leave as is. Secondly, if changing the top and bottom margins, I might have another go at trying to get the book title on the left header and chapter title in the right header. Thirdly, I might try a slightly larger font for the body text, given the target readership. And fourthly, I’m not so sure about the front cover colour now - the original photo was one of the very few colour photos he had (almost all were black and white, many actually a strong sepia colour given their the age but which converted nicely to black and white), and it did have that strong purpleish colour which I kept because I thought it conveyed both age and coldness, but now I’ve seen it listed online against other books I’m not so sure now and am a bit worried it looks like a mistake.
A final note - curiously, KDP don’t currently provide any reminder about Legal Deposit requirements. In the UK’s case, one copy has to be sent to the British Library within 1 month of publication, and the 5 other Legal Deposit Libraries may request copies within 12 months.
Oh, and in case anyone really is interested, here’s the actual book: Years on Ice: Life in Antarctica 1958-1965.
-
Back when I did my MSc, I was actually pretty keen, believe it or not, on an up-and-coming new operating system called Microsoft Windows. When it came to writing up my MSc thesis, it was just assumed everyone would use LaTeX. However, I asked what the rules were, and the only strict requirement was that the department had to have a Postscript version of the thesis so they could print out copies easily. I checked the word processing package I had, which was something called Microsoft Word for Windows, and found it was able to print to a Postscript file. Subsequently I believe I became the first person in the department to write their thesis in Microsoft Word. Half way through I needed to pay a lot of money to upgrade the RAM on my PC in order to be able to finish my thesis, but it seemed worth it to be on the cutting edge of an exciting new technology, and for the smug feeling of being the one doing it “the easy way” while everyone else seemed to be battling LaTeX. Of course, with the benefit of hindsight, I think I was wrong, for many reasons, not least because all those people who used LaTeX still have their thesis in a perfectly readable format decades later whereas mine is locked up in a proprietary binary format which no-one can decode any more. On that note, when I was working at a software company, they had some people over from the Microsoft HQ in Redmond to help with some of their routines for opening Word documents, and (at least according to the Microsoft staff) it seems Microsoft had actually lost some of the source code for opening and decoding some of the early Word document formats. Anyway, the point of this anecdote is not just that open formats are are better than closed proprietary ones, but also that shiny and new isn’t necessary better in the long term. When I meet someone just starting out in their career, raving about the latest trendy new JavaScript library for example, I sometimes remind myself of that, but also that I should be respectful given that I was once like that. ↩︎