Jump to content

Help:How to port pages

From Appropedia
(Redirected from Porting from PDF)

Aims of porting[edit | edit source]

When we port pages, we have these aims:

  1. Provide clear attribution to the original author or copyright owner including details of the conditions under which the content can be reused.
  2. Present the original, "as the author intended" uneditable version. Once a port is complete and has been reviewed for noticeable hiccups, we create a specially named version of the file (beginning in Original:), and protecting that version from edits (except by admins). Note that "as the author intended" may have subtle differences:
    • WWW links need to begin with http:// or https:// in order to function correctly, and they should be edited if necessary
    • References to images within the article may need adjusting since image placement may not be the same as in the original. E.g. "the picture at right" may become "picture #1", and captioning may also need related changes.
    • The original may be too large for a single article, and may be partitioned. See Category:Beyond dams.
  3. We will create links to other pages as appropriate to extend the usefulness of the content.
  4. Incorporate the content into our Wiki pages, combining it with content from other sources and updating or expanding it as neccessary in any other way consistent with the license for this content.

Fixed vs evolving content[edit | edit source]

Different types of ported content call for different policies.

Ported Project pages represent descriptions of past activities. As such, the ported content is not an appropriate target for modification, and so an editable version of this content is not useful. Nevertheless, the page containing the content may legitimately be modified through changing or adding templates, links and categorization, either before or after the original content. An editable page that includes locked content, provided in a protected template, provides this capability. The template will include the original content plus appropriate attribution of the original content. There is no value in having a separate editable pages.

On the other hand, for material that lends itself to collaborative evolution, a single copy of the original content cannot be both frozen and editable. Two pages will be established. First, a page containing frozen content will be created using the same protected template approach as for ported Project content. In addition, a second page will be created which includes all the content without protection. Attribution is required, as well as a link to the protected original content page.

Some content may contain a mix of historical (and therefore essentially fixed) content, plus content that lends itself to collaboration. Mixed content will be treated as collaborative content, as long as there is at least a substantial amount of collaborative content.

Identify the material to port and confirm it is available under a suitable license[edit | edit source]

The first phase of the process is to get copyleft permission and move the content cleanly into Appropedia. This portion of the process does not require administrator priveleges. Here are the steps:

  • Identify the material to be ported.
  • In some cases this may already be done. E.g. see Practical Action/Pages to port
    • If you have identified a set of content to port, and want to see which files have already been ported, you can either search for the source name (e.g. search for Practical Action), or if there is a specific template used on material from that source, go to that template page click "What links here" in the left menu bar. (The link can be written out, e.g. Special:Whatlinkshere/Template:Attrib PATB.) Also, you should always search for the title before doing a specific document, in case someone did it but didn't include the template or source name.
  • Get permission from the copyright owner.
  • If you got permission in a personal conversation then send an email to the donor, confirming what you believe was agreed. See the next item for details.
  • If you got permission via an email then check that it confirms
  1. what materials are covered
  2. that they sender of the email has the right to grant us permission on behalf of the copyright owner
  3. that we can publish the material under the CC-BY-SA (or CC-BY) license.
  • If it doesn't have all three of the elements above then email them back. Your email should describe the content covered and ask them to confirm items 2 and 3 above. Include a link to our CC-BY-SA page.
  • In some cases this also may already be done. E.g. see again Practical Action/Pages to port

Create an Organisation page for the copyright owner[edit | edit source]

Attribution is not just a legal obligation. We are grateful to all the organizations contributing material to Appropedia and we want to make the attribution information as complete and up to date as possible.

  • Create an page for the copyright owner.
  • Include a link to their webpage
  • Include a brief mission statement copied from their webpage.
  • Include details of the permission granted to us by them to copy their content. This is so that, if there is a dispute later, it is clear why we think we have permission. Complete information will mean disputes get resolved much more quickly.
  • If the permission is on a web page then copy the wording of the license from that web page, include the url of that web page and note the date that you accessed that page. If the permission only applies to material on that page then note what is on that page.
  • If the permission is via email then add a note to the organisation page with the date of the email, who sent the email, who it was addressed to (use names not email addresses - we don't want to give email addresses to spammers) and quote the exact words of the email.
  • If the permissions get complicated (e.g. different conditions applying to different documents) then the licensing information may be moved to a separate sub-page named Organization Name/Licensing
  • Include a Category link on the organisation page to a Category for content from that organisation. This will create a new Category page. Add a note to the category page linking back to the organisation page.

Create a page for each document to be ported[edit | edit source]

  • Create the new wiki page in the "Original" namespace (that simply means the page name begins with "Original:" followed by the document title used for the original document.)
  • Copy/paste the following at the very top of the article on a line by itself:
{{Original header}}
  • If the porting process might take more than an hour, notify others that "porting is in process" by copy/pasting the following at the very top of the article on a line by itself
{{Being ported}}
  • Post the content to the wiki page in the original "as the author intended" version, without changes.
  • Exceptions to the "without changes" rule (these exceptions should not change author's intent):
  • Changes to references to images within the document, and to the image captions, may be needed to retain the original meaning. For example, "picture at right" may be ambiguous because it can be difficult to exactly match layout, and so reference may be altered to "picture #xyz".
  • WWW URLs are required to begin with "http://" or "https://". URLs without those prefixes should be altered to include the prefix in order for the links to work correctly.
  • Categorization and notice templates may be added at the very beginning and very end of the article. This should include the Category for the organisation who owns the copyright of the content.
  • Use the appropriate template to give credit to the author(s) and link to the source document, if possible.
  • If you are not comfortable with templates, then put some information on the talk page and an admin will help.
  • When porting of the content is complete, remove the "being ported" template (if any) and copy/paste the following on a line by itself at the top of the article to notify an admin that the article is ready for protection:
{{Checkoriginal|~~~}}

Lock the original to prevent further changes[edit | edit source]

This next phase of porting must be done by an Appropedia admin. The details of the admin responsibilities are described at Category:Checkoriginal

Templates[edit | edit source]

Attribution templates[edit | edit source]

Other related templates[edit | edit source]


Methods[edit | edit source]

At present, we do not have a simple one-step PDF-to-MediaWiki translation process which retains the desired text formatting. We have many multi-step approaches which retain text formatting, all of which can be broken down into two main steps:

  1. Convert (save) the PDF to a more workable intermediate format that supports interesting formatting, and
  2. Convert the resulting file into MediaWiki format.

The various options for each of these main steps are described below.

The OpenOffice 3 approach may come closest to a straightforward solution, if we can get it to work.

Alas, we do not yet have an automated way to transfer the images (though mw:Extension:MultiUpload may make it easier). Help is welcome!


1. Save as formatted text[edit | edit source]

Convert to a Word, RTF or HTML file.

It is preferable to use:

  1. An open source solution, which can be used by anyone and can be improved if needed.
  2. If this doesn't work, Acrobat Professional - some academics, students and business people will have access to it, and it is likely to work better than freeware or web services.

(When searching for solutions, note that word combinations like PDF export gets a lot of false hits - mainly exporting to PDF, and also very many commercial programs. So, try this search:

export OR convert pdf-to freeware -demo -free-trial images formatted OR layout)

Open source options[edit | edit source]

There is an extension for OpenOffice 3 Beta (presumably works with OO3) that facilitates import of pdf documents Sun PDF Import Extension (Beta). If this works well, combined with OpenOffice's existing MediaWiki export functionality, this may be a one stop tool for PDF to MediaWiki conversion. Chriswaterguy is trying this out now, but having trouble with bugs preventing acceptance of the EULA. 21:24, 14 December 2008 (UTC)

Alternatively:

  1. use https://sourceforge.net/projects/pdftohtml/ or xpdf to convert to html
  2. clean up with htmltidy. It should now be ready to convert to MediaWiki.

Scanned documents:

  • Sometimes scanned documents have the actual text embedded in the document.
    • The pdftotext command extracts raw text: "pdftotext file.pdf" without the quotes.
    • Evince PDF viewer (and possibly others) allow you to select and copy.

Acrobat Professional (i.e. paid version)[edit | edit source]

This is the best option so far. Check if your school/college/company has this program (you might have to ask for access).

  • In the non-free Adobe Acrobat there is an option to save to rich text formats - see Save a PDF file as a Word document, HTML file, or image. Fatima has used this and found it helpful. We should experiment and find the best path - saving to which format, and then using which method to translate to wiki markup.[expansion needed]
  • The free program Acrobat Reader has an "export as text" function, but only plain text. Copying and pasting also only gives plain text.

We could also test other readers to see if any allow copying with formatting. The following only do plain text: Evince Document Viewer 2.24 for Linux.

"Freeware"[edit | edit source]

Options that are free (as in free beer) but not open source:

  • Sorax PDF SDK DLL Edition 1.1 - "export PDF files to... XML." (image or text?)
    Okay, so I (CurtB) have poked around with the Sorax DLL, and it looks interesting. I figured this commentary makes more sense on the article page than the talk page, and yet this is almost a discussion at this point, hence my chatty tone. It turns out that the DLL is indeed free, and the license for usage is generous, essentially do what you want with it as long as you don't reverse engineer it or hack it into something else. It also comes with a "demo" application, which could be very useful. Using this demo program, one can open a PDF document and export it into XML. It exports all the text (no images, sorry) into an xml file with useful formatting information, including font name, font size, italic or bold (true or false for those last two). There is a fair bit of other info that might not be interesting (paging, for example) and would need to be stripped off. Nevertheless, it is quite conceivable that a PERL or Python-based tool could quickly be written to strip the undesired stuff away, and convert the remainder to wiki form. Maybe even some clever SED scripts could do the bulk of the work! Yay!
    It should be noted that the DLL is really intended for use by developers, and most particularly for Visual C developers, since the tool includes a "vcproj" file, which is a Visual C project file. Python developers also may be able to make use of the tool, based on the information in the included (PDF, of course) document, and with some help from this page I found. Writing an actual application that could use the DLL would be ideal, since it would allow bulk translation of PDFs, instead of the one-by-one conversion process that would be offered by the Demo application plus xml-to-wiki tool. Hmm. New thought. How well does OpenOffice convert XML to Wiki? Be right back! Nope, not much help. Okay, done for now. CurtB 00:29, 5 February 2008 (PST)
Does Sorax do the formatting for image placement? (As wikEd does when converting HTML.)
  • Free PDF to Word Doc Converter - reviews and comments[1] suggest that this is "nagware" (i.e. freeware hassles you, adds extra steps) and that Zamzar (online service, below) gives better results.

Free online services[edit | edit source]

Check these (and do a search to make sure you've got the latest version):

  • docq - upload the file and it will convert online. DocQ provides online PDF editing, highlighting, and e-signing. Free account trials available.
  • Zamzar (review) - upload the file and receive an email with a link to the output file. Works well, some hassle and hiccups. Formatting may need extra work, e.g. double line-breaks need replacing with single line-breaks for best results. This is the only free solution known to work so far.
  • Jpeg compressor Utilize our free online image compressor to reduce the file size of your photographs while maintaining the quality of the original JPEG, JPG, PNG, SVG, GIF, and WEBP files.
  • Adobe's online conversion service - appears broken. After a long period (e.g. 75 min) it still displays "In progress".
  • Form Swift

Commercial programs (apart from Adobe Acrobat)[edit | edit source]

Question: are there free trial versions that do what we need? Help by trying them out. (These programs are not guaranteed - do some Googling to make sure they're safe, and make sure you've got good anti-spyware and anti-virus.)

These are not ideal, as:

  1. we can't invite everybody to help out without paying lots of money or stretching/breaking the licensing agreements,
  2. they usually take an extra step, via Word, and
  3. They're only for Windows.

But for reference (in case of desperation):

OCR[edit | edit source]

When a PDF file (or other format) is image based rather than text-based, this may be helpful. See User talk:LeissKG for a discussion of this technique.

OCR should probably be limited to those cases when text is only available as an image, as it will inevitably introduce some errors. It seems likely to be more difficult as well.[verification needed] Nevertheless, if proven out, this could be a useful tool for creating wiki versions of out-of-print articles or texts. Care must be taken, however, that copyright permissions are handled appropriately!

Here are some resources for OCR:

2. Convert from formatted text to MediaWiki[edit | edit source]

There are several options, notably using wikEd, or OpenOffice (version 2.3 or higher). See Appropedia:Porting formatted content to MediaWiki for full details.


Manual formatting - old method[edit | edit source]

This is not recommended, but if you have problems with the other methods and need to try it, see Help:Porting PDF files to MediaWiki (old method, manual formatting).

Images[edit | edit source]

Images must be saved and uploaded.

  • Until now, this has been done as described at Help:Porting PDF files to MediaWiki (old method, manual formatting) #Transfer the images. There may be easier ways now, but there are still useful info and tips there, e.g. don't try too hard to match the layout of the original... PDF's are fixed size, while the layout of the wiki article will flex based on several variables. So invest some energy in layout, but don't overdo it.
  • In PDF-to-HTML conversion the images will be output in the same folder. (However, with Zamzar, each page's images are turned into a single image taking up the whole page - the text fits around it.)
  • In PDF-to-Word conversion the images will be integrated in the document.
  • Acrobat: Images are apparently saved automatically during file export:

Question: Which of the formats include tags to indicate image location?


See also[edit | edit source]

Cookies help us deliver our services. By using our services, you agree to our use of cookies.