Skip to content

Introduction

There are 5 components to the code. They are kept as separate files in the html2text directory. This part of the documentation explains them bit by bit.

compat.py

This part exists only to test compatibility with the available python standard libraries. Python3 relocated some libraries and so this file makes sure that everything has a common interface.

config.py

Used to provide various configuration settings to the converter. They are as follows:

  • UNICODE_SNOB for using unicode
  • ESCAPE_SNOB for escaping every special character
  • LINKS_EACH_PARAGRAPH for putting links after every paragraph
  • BODY_WIDTH for wrapping long lines
  • SKIP_INTERNAL_LINKS to skip #local-anchor things
  • INLINE_LINKS for formatting images and links
  • PROTECT_LINKS protect from line breaks
  • GOOGLE_LIST_INDENT no of pixels to indent nested lists
  • IGNORE_ANCHORS
  • IGNORE_IMAGES
  • IMAGES_AS_HTML always generate HTML tags for images; preserves height, width, alt if possible.
  • IMAGES_TO_ALT
  • IMAGES_WITH_SIZE
  • IGNORE_EMPHASIS
  • BYPASS_TABLES format tables in HTML rather than Markdown
  • IGNORE_TABLES ignore table-related tags (table, th, td, tr) while keeping rows
  • SINGLE_LINE_BREAK to use a single line break rather than two
  • UNIFIABLE is a dictionary which maps unicode abbreviations to ASCII values
  • RE_SPACE for finding space-only lines
  • RE_ORDERED_LIST_MATCHER for matching ordered lists in MD
  • RE_UNORDERED_LIST_MATCHER for matching unordered list matcher in MD
  • RE_MD_CHARS_MATCHER for matching Md \,[,],( and )
  • RE_MD_CHARS_MATCHER_ALL for matching ,*,_,{,},[,],(,),#,!
  • RE_MD_DOT_MATCHER for matching lines starting with 1.
  • RE_MD_PLUS_MATCHER for matching lines starting with +
  • RE_MD_DASH_MATCHER for matching lines starting with (-)
  • RE_SLASH_CHARS a string of slash escapeable characters
  • RE_MD_BACKSLASH_MATCHER to match \char
  • USE_AUTOMATIC_LINKS to convert <a href='http://xyz'>http://xyz</a> to <http://xyz>

utils.py

Used to provide utility functions to html2text Some functions are:

FunctionDescription
name2cpname to code point
hnheadings
dumb_property_dicthash of css attrs
dumb_css_parserreturns a hash of css selectors, each containing a hash of css attrs
element_stylehash of final style of element
google_list_stylefind out ordered?unordered
google_has_heightdoes element have height?
google_text_emphasisa list of all emphasis modifiers
google_fixed_width_fontcheck for fixed width font
list_numbering_startextract numbering from list elem attrs
skipwrapskip wrap for give para or not?
escape_mdescape md sensitive within other md
escape_md_sectionescape md sensitive across whole doc

cli.py

Command line interface for the code.

OptionDescription
--versionShow program version number and exit
-h, --helpShow this help message and exit
--ignore-linksDo not include any formatting for links
--protect-linksProtect links from line breaks surrounding them "+" with angle brackets
--ignore-imagesDo not include any formatting for images
--images-to-altDiscard image data, only keep alt text
--images-with-sizeWrite image tags with height and width attrs as raw html to retain dimensions
--images-as-htmlAlways write image tags as raw html; preserves "height", "width" and "alt" if possible.
-g, --google-docConvert an html-exported Google Document
-d, --dash-unordered-listUse a dash rather than a star for unordered list items
-b BODY_WIDTH, --body-width=BODY_WIDTHNumber of characters per output line, 0 for no wrap
-i LIST_INDENT, --google-list-indent=LIST_INDENTNumber of pixels Google indents nested lists
-s, --hide-strikethroughHide strike-through text. only relevant when -g is specified as well
--escape-allEscape all special characters. Output is less readable, but avoids corner case formatting issues.
--bypass-tablesFormat tables in HTML rather than Markdown syntax.
--ignore-tablesIgnore table-related tags (table, th, td, tr) while keeping rows.
--single-line-breakUse a single line break after a block element rather than two.
--reference-linksUse reference links instead of inline links to create markdown

A complete list is available here

init.py

This is where everything comes together. This is the glue for all the things we have described above.

This file describes a single HTML2Text class which is itself a subclass of the HTMLParser in python

Upon initialization it sets various config variables necessary for processing the given html in a certain manner necessary to create valid markdown text. The class defines methods:

  • feed
  • handle
  • outtextf
  • close
  • handle_charref
  • handle_entityref
  • handle_starttag
  • handle_endtag
  • previousIndex
  • handle_emphasis
  • handle_tag
  • pbr
  • p
  • soft_br
  • o
  • handle_data
  • charref
  • entityref
  • google_nest_count
  • optwrap

Besides this there are 2 more methods defined:

OptionDescription
html2textcalls the HTML2Text class with .handle() method
unescapecalls the HTML2Text class with .unescape() method
  • html2text :calls the HTML2Text class with .handle() method
  • unescape :calls the HTML2Text class with .unescape() method

What they do is provide methods to make the HTML parser in python parse the HTML and convert to markdown.