Introduction
There are 5 components to the code. They are kept as separate files in the html2text directory. This part of the documentation explains them bit by bit.
compat.py
This part exists only to test compatibility with the available python standard libraries. Python3 relocated some libraries and so this file makes sure that everything has a common interface.
config.py
Used to provide various configuration settings to the converter. They are as follows:
UNICODE_SNOBfor using unicodeESCAPE_SNOBfor escaping every special characterLINKS_EACH_PARAGRAPHfor putting links after every paragraphBODY_WIDTHfor wrapping long linesSKIP_INTERNAL_LINKSto skip #local-anchor thingsINLINE_LINKSfor formatting images and linksPROTECT_LINKSprotect from line breaksGOOGLE_LIST_INDENTno of pixels to indent nested listsIGNORE_ANCHORSIGNORE_IMAGESIMAGES_AS_HTMLalways generate HTML tags for images; preservesheight,width,altif possible.IMAGES_TO_ALTIMAGES_WITH_SIZEIGNORE_EMPHASISBYPASS_TABLESformat tables in HTML rather than MarkdownIGNORE_TABLESignore table-related tags (table,th,td,tr) while keeping rowsSINGLE_LINE_BREAKto use a single line break rather than twoUNIFIABLEis a dictionary which maps unicode abbreviations to ASCII valuesRE_SPACEfor finding space-only linesRE_ORDERED_LIST_MATCHERfor matching ordered lists in MDRE_UNORDERED_LIST_MATCHERfor matching unordered list matcher in MDRE_MD_CHARS_MATCHERfor matching Md\,[,],(and)RE_MD_CHARS_MATCHER_ALLfor matching,*,_,{,},[,],(,),#,!RE_MD_DOT_MATCHERfor matching lines starting with1.RE_MD_PLUS_MATCHERfor matching lines starting with+RE_MD_DASH_MATCHERfor matching lines starting with(-)RE_SLASH_CHARSa string of slash escapeable charactersRE_MD_BACKSLASH_MATCHERto match\charUSE_AUTOMATIC_LINKSto convert<a href='http://xyz'>http://xyz</a>to<http://xyz>
utils.py
Used to provide utility functions to html2text Some functions are:
| Function | Description |
|---|---|
name2cp | name to code point |
hn | headings |
dumb_property_dict | hash of css attrs |
dumb_css_parser | returns a hash of css selectors, each containing a hash of css attrs |
element_style | hash of final style of element |
google_list_style | find out ordered?unordered |
google_has_height | does element have height? |
google_text_emphasis | a list of all emphasis modifiers |
google_fixed_width_font | check for fixed width font |
list_numbering_start | extract numbering from list elem attrs |
skipwrap | skip wrap for give para or not? |
escape_md | escape md sensitive within other md |
escape_md_section | escape md sensitive across whole doc |
cli.py
Command line interface for the code.
| Option | Description |
|---|---|
--version | Show program version number and exit |
-h, --help | Show this help message and exit |
--ignore-links | Do not include any formatting for links |
--protect-links | Protect links from line breaks surrounding them "+" with angle brackets |
--ignore-images | Do not include any formatting for images |
--images-to-alt | Discard image data, only keep alt text |
--images-with-size | Write image tags with height and width attrs as raw html to retain dimensions |
--images-as-html | Always write image tags as raw html; preserves "height", "width" and "alt" if possible. |
-g, --google-doc | Convert an html-exported Google Document |
-d, --dash-unordered-list | Use a dash rather than a star for unordered list items |
-b BODY_WIDTH, --body-width=BODY_WIDTH | Number of characters per output line, 0 for no wrap |
-i LIST_INDENT, --google-list-indent=LIST_INDENT | Number of pixels Google indents nested lists |
-s, --hide-strikethrough | Hide strike-through text. only relevant when -g is specified as well |
--escape-all | Escape all special characters. Output is less readable, but avoids corner case formatting issues. |
--bypass-tables | Format tables in HTML rather than Markdown syntax. |
--ignore-tables | Ignore table-related tags (table, th, td, tr) while keeping rows. |
--single-line-break | Use a single line break after a block element rather than two. |
--reference-links | Use reference links instead of inline links to create markdown |
A complete list is available here
init.py
This is where everything comes together. This is the glue for all the things we have described above.
This file describes a single HTML2Text class which is itself a subclass of the HTMLParser in python
Upon initialization it sets various config variables necessary for processing the given html in a certain manner necessary to create valid markdown text. The class defines methods:
- feed
- handle
- outtextf
- close
- handle_charref
- handle_entityref
- handle_starttag
- handle_endtag
- previousIndex
- handle_emphasis
- handle_tag
- pbr
- p
- soft_br
- o
- handle_data
- charref
- entityref
- google_nest_count
- optwrap
Besides this there are 2 more methods defined:
| Option | Description |
|---|---|
html2text | calls the HTML2Text class with .handle() method |
unescape | calls the HTML2Text class with .unescape() method |
- html2text :calls the HTML2Text class with .handle() method
- unescape :calls the HTML2Text class with .unescape() method
What they do is provide methods to make the HTML parser in python parse the HTML and convert to markdown.