Usage
The module is simple enough to use. This tutorial will get you started.
Installing
These are the methods you can get the module installed:-
PIP
For those who have pip, we got your back.
$ pip install html2textClone from Git Repository
Clone the repository from https://github.com/Alir3z4/html2text
$ git clone --depth 50 https://github.com/Alir3z4/html2text.git
$ python -m build -nwx
$ python -m pip install --upgrade ./dist/*.whlBasic Usage
Once installed the module can be used as follows.
import html2text
html = function_to_get_some_html()
text = html2text.html2text(html)
print(text)This converts the provided html to text( Markdown text) with all the options set to default.
Using Options
To customize the options provided by the module the usage is as follows:
import html2text
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text_maker.bypass_tables = False
html = function_to_get_some_html()
text = text_maker.handle(html)
print(text)Available Options
All options exist in the config.py file. A list is provided here with simple indications of their function.
UNICODE_SNOBfor using unicodeESCAPE_SNOBfor escaping every special characterLINKS_EACH_PARAGRAPHfor putting links after every paragraphBODY_WIDTHfor wrapping long linesSKIP_INTERNAL_LINKSto skip #local-anchor thingsINLINE_LINKSfor formatting images and linksPROTECT_LINKSprotect from line breaksGOOGLE_LIST_INDENTno of pixels to indent nested listsIGNORE_ANCHORSIGNORE_IMAGESIMAGES_AS_HTMLalways generate HTML tags for images; preservesheight,width,altif possible.IMAGES_TO_ALTIMAGES_WITH_SIZEIGNORE_EMPHASISBYPASS_TABLESformat tables in HTML rather than MarkdownIGNORE_TABLESignore table-related tags (table,th,td,tr) while keeping rowsSINGLE_LINE_BREAKto use a single line break rather than twoUNIFIABLEis a dictionary which maps unicode abbreviations to ASCII valuesRE_SPACEfor finding space-only linesRE_ORDERED_LIST_MATCHERfor matching ordered lists in MDRE_UNORDERED_LIST_MATCHERfor matching unordered list matcher in MDRE_MD_CHARS_MATCHERfor matching Md\,[,],(and)RE_MD_CHARS_MATCHER_ALLfor matching,*,_,{,},[,],(,),#,!RE_MD_DOT_MATCHERfor matching lines starting with1.RE_MD_PLUS_MATCHERfor matching lines starting with+RE_MD_DASH_MATCHERfor matching lines starting with(-)RE_SLASH_CHARSa string of slash escapeable charactersRE_MD_BACKSLASH_MATCHERto match\charUSE_AUTOMATIC_LINKSto convert<a href='http://xyz'>http://xyz</a>to<http://xyz>MARK_CODEto wrap 'pre' blocks with [code]...[/code] tagsWRAP_LINKSto decide if links have to be wrapped during text wrapping (implies INLINE_LINKS = False)WRAP_LIST_ITEMSto decide if list items have to be wrapped during text wrappingWRAP_TABLESto decide if tables have to be wrapped during text wrappingDECODE_ERRORSto handle decoding errors. 'strict', 'ignore', 'replace' are the acceptable values.DEFAULT_IMAGE_ALTtakes a string as value and is used whenever an image tag is missing analtvalue. The default for this is an empty string '' to avoid backward breakageOPEN_QUOTEis the character used to open a quote when replacing the<q>tag. It defaults to".CLOSE_QUOTEis the character used to close a quote when replacing the<q>tag. It defaults to".
Options that are not in the config.py file:
emphasis_markis the character used when replacing the<em>tag. It defaults to_.strong_markis the characer used when replacing the<strong>tag. It defaults to**.
To alter any option the procedure is to create a parser with parser = html2text.HTML2Text() and to set the option on the parser. example: parser.unicode_snob = True to set the UNICODE_SNOB option.
Command line options
| Option | Description |
|---|---|
--version | Show program version number and exit |
-h, --help | Show this help message and exit |
--ignore-links | Do not include any formatting for links |
--protect-links | Protect links from line breaks surrounding them "+" with angle brackets |
--ignore-images | Do not include any formatting for images |
--images-as-html | Always write image tags as raw html; preserves "height", "width" and "alt" if possible. |
--images-to-alt | Discard image data, only keep alt text |
--images-with-size | Write image tags with height and width attrs as raw html to retain dimensions |
-g, --google-doc | Convert an html-exported Google Document |
-d, --dash-unordered-list | Use a dash rather than a star for unordered list items |
-b BODY_WIDTH, --body-width=BODY_WIDTH | Number of characters per output line, 0 for no wrap |
-i LIST_INDENT, --google-list-indent=LIST_INDENT | Number of pixels Google indents nested lists |
-s, --hide-strikethrough | Hide strike-through text. only relevant when -g is specified as well |
--escape-all | Escape all special characters. Output is less readable, but avoids corner case formatting issues. |
--bypass-tables | Format tables in HTML rather than Markdown syntax. |
--ignore-tables | Ignore table-related tags (table, th, td, tr) while keeping rows. |
--single-line-break | Use a single line break after a block element rather than two. |
--reference-links | Use reference links instead of inline links to create markdown |
--ignore-emphasis | Ignore all emphasis formatting in the html. |
--include-sup-sub | Include <sub> and <sup> tags. |
-e, --asterisk-emphasis | Use asterisk rather than underscore to emphasize text |
--unicode-snob | Use unicode throughout instead of ASCII |
--no-automatic-links | Do not use automatic links like https://www.google.com/ |
--no-skip-internal-links | Turn off skipping of internal links |
--links-after-para | Put the links after the paragraph and not at end of document |
--mark-code | Mark code with [code]...[/code] blocks |
--no-wrap-links | Do not wrap links during text wrapping. Implies --reference-links |
--wrap-list-items | Wrap list items during text wrapping. |
--wrap-tables | Wrap tables during text wrapping. |
--decode-errors=HANDLER | What to do in case an error is encountered. ignore, strict, replace etc. |
--pad-tables | Use padding to make tables look good. |
--default-image-alt=Image_Here | Inserts the given alt text whenever images are missing alt values. |
--open-quote=" | Inserts the given text when opening a quote. Defaults to ". |
--close-quote=" | Inserts the given text when closing a quote. Defaults to ". |