m2m模型翻译
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

423 lines
16 KiB

7 months ago
  1. # Human friendly input/output in Python.
  2. #
  3. # Author: Peter Odding <peter@peterodding.com>
  4. # Last Change: February 29, 2020
  5. # URL: https://humanfriendly.readthedocs.io
  6. """Convert HTML with simple text formatting to text with ANSI escape sequences."""
  7. # Standard library modules.
  8. import re
  9. # Modules included in our package.
  10. from humanfriendly.compat import HTMLParser, StringIO, name2codepoint, unichr
  11. from humanfriendly.text import compact_empty_lines
  12. from humanfriendly.terminal import ANSI_COLOR_CODES, ANSI_RESET, ansi_style
  13. # Public identifiers that require documentation.
  14. __all__ = ('HTMLConverter', 'html_to_ansi')
  15. def html_to_ansi(data, callback=None):
  16. """
  17. Convert HTML with simple text formatting to text with ANSI escape sequences.
  18. :param data: The HTML to convert (a string).
  19. :param callback: Optional callback to pass to :class:`HTMLConverter`.
  20. :returns: Text with ANSI escape sequences (a string).
  21. Please refer to the documentation of the :class:`HTMLConverter` class for
  22. details about the conversion process (like which tags are supported) and an
  23. example with a screenshot.
  24. """
  25. converter = HTMLConverter(callback=callback)
  26. return converter(data)
  27. class HTMLConverter(HTMLParser):
  28. """
  29. Convert HTML with simple text formatting to text with ANSI escape sequences.
  30. The following text styles are supported:
  31. - Bold: ``<b>``, ``<strong>`` and ``<span style="font-weight: bold;">``
  32. - Italic: ``<i>``, ``<em>`` and ``<span style="font-style: italic;">``
  33. - Strike-through: ``<del>``, ``<s>`` and ``<span style="text-decoration: line-through;">``
  34. - Underline: ``<ins>``, ``<u>`` and ``<span style="text-decoration: underline">``
  35. Colors can be specified as follows:
  36. - Foreground color: ``<span style="color: #RRGGBB;">``
  37. - Background color: ``<span style="background-color: #RRGGBB;">``
  38. Here's a small demonstration:
  39. .. code-block:: python
  40. from humanfriendly.text import dedent
  41. from humanfriendly.terminal import html_to_ansi
  42. print(html_to_ansi(dedent('''
  43. <b>Hello world!</b>
  44. <i>Is this thing on?</i>
  45. I guess I can <u>underline</u> or <s>strike-through</s> text?
  46. And what about <span style="color: red">color</span>?
  47. ''')))
  48. rainbow_colors = [
  49. '#FF0000', '#E2571E', '#FF7F00', '#FFFF00', '#00FF00',
  50. '#96BF33', '#0000FF', '#4B0082', '#8B00FF', '#FFFFFF',
  51. ]
  52. html_rainbow = "".join('<span style="color: %s">o</span>' % c for c in rainbow_colors)
  53. print(html_to_ansi("Let's try a rainbow: %s" % html_rainbow))
  54. Here's what the results look like:
  55. .. image:: images/html-to-ansi.png
  56. Some more details:
  57. - Nested tags are supported, within reasonable limits.
  58. - Text in ``<code>`` and ``<pre>`` tags will be highlighted in a
  59. different color from the main text (currently this is yellow).
  60. - ``<a href="URL">TEXT</a>`` is converted to the format "TEXT (URL)" where
  61. the uppercase symbols are highlighted in light blue with an underline.
  62. - ``<div>``, ``<p>`` and ``<pre>`` tags are considered block level tags
  63. and are wrapped in vertical whitespace to prevent their content from
  64. "running into" surrounding text. This may cause runs of multiple empty
  65. lines to be emitted. As a *workaround* the :func:`__call__()` method
  66. will automatically call :func:`.compact_empty_lines()` on the generated
  67. output before returning it to the caller. Of course this won't work
  68. when `output` is set to something like :data:`sys.stdout`.
  69. - ``<br>`` is converted to a single plain text line break.
  70. Implementation notes:
  71. - A list of dictionaries with style information is used as a stack where
  72. new styling can be pushed and a pop will restore the previous styling.
  73. When new styling is pushed, it is merged with (but overrides) the current
  74. styling.
  75. - If you're going to be converting a lot of HTML it might be useful from
  76. a performance standpoint to re-use an existing :class:`HTMLConverter`
  77. object for unrelated HTML fragments, in this case take a look at the
  78. :func:`__call__()` method (it makes this use case very easy).
  79. .. versionadded:: 4.15
  80. :class:`humanfriendly.terminal.HTMLConverter` was added to the
  81. `humanfriendly` package during the initial development of my new
  82. `chat-archive <https://chat-archive.readthedocs.io/>`_ project, whose
  83. command line interface makes for a great demonstration of the
  84. flexibility that this feature provides (hint: check out how the search
  85. keyword highlighting combines with the regular highlighting).
  86. """
  87. BLOCK_TAGS = ('div', 'p', 'pre')
  88. """The names of tags that are padded with vertical whitespace."""
  89. def __init__(self, *args, **kw):
  90. """
  91. Initialize an :class:`HTMLConverter` object.
  92. :param callback: Optional keyword argument to specify a function that
  93. will be called to process text fragments before they
  94. are emitted on the output stream. Note that link text
  95. and preformatted text fragments are not processed by
  96. this callback.
  97. :param output: Optional keyword argument to redirect the output to the
  98. given file-like object. If this is not given a new
  99. :class:`~python3:io.StringIO` object is created.
  100. """
  101. # Hide our optional keyword arguments from the superclass.
  102. self.callback = kw.pop("callback", None)
  103. self.output = kw.pop("output", None)
  104. # Initialize the superclass.
  105. HTMLParser.__init__(self, *args, **kw)
  106. def __call__(self, data):
  107. """
  108. Reset the parser, convert some HTML and get the text with ANSI escape sequences.
  109. :param data: The HTML to convert to text (a string).
  110. :returns: The converted text (only in case `output` is
  111. a :class:`~python3:io.StringIO` object).
  112. """
  113. self.reset()
  114. self.feed(data)
  115. self.close()
  116. if isinstance(self.output, StringIO):
  117. return compact_empty_lines(self.output.getvalue())
  118. @property
  119. def current_style(self):
  120. """Get the current style from the top of the stack (a dictionary)."""
  121. return self.stack[-1] if self.stack else {}
  122. def close(self):
  123. """
  124. Close previously opened ANSI escape sequences.
  125. This method overrides the same method in the superclass to ensure that
  126. an :data:`.ANSI_RESET` code is emitted when parsing reaches the end of
  127. the input but a style is still active. This is intended to prevent
  128. malformed HTML from messing up terminal output.
  129. """
  130. if any(self.stack):
  131. self.output.write(ANSI_RESET)
  132. self.stack = []
  133. HTMLParser.close(self)
  134. def emit_style(self, style=None):
  135. """
  136. Emit an ANSI escape sequence for the given or current style to the output stream.
  137. :param style: A dictionary with arguments for :func:`.ansi_style()` or
  138. :data:`None`, in which case the style at the top of the
  139. stack is emitted.
  140. """
  141. # Clear the current text styles.
  142. self.output.write(ANSI_RESET)
  143. # Apply a new text style?
  144. style = self.current_style if style is None else style
  145. if style:
  146. self.output.write(ansi_style(**style))
  147. def handle_charref(self, value):
  148. """
  149. Process a decimal or hexadecimal numeric character reference.
  150. :param value: The decimal or hexadecimal value (a string).
  151. """
  152. self.output.write(unichr(int(value[1:], 16) if value.startswith('x') else int(value)))
  153. def handle_data(self, data):
  154. """
  155. Process textual data.
  156. :param data: The decoded text (a string).
  157. """
  158. if self.link_url:
  159. # Link text is captured literally so that we can reliably check
  160. # whether the text and the URL of the link are the same string.
  161. self.link_text = data
  162. elif self.callback and self.preformatted_text_level == 0:
  163. # Text that is not part of a link and not preformatted text is
  164. # passed to the user defined callback to allow for arbitrary
  165. # pre-processing.
  166. data = self.callback(data)
  167. # All text is emitted unmodified on the output stream.
  168. self.output.write(data)
  169. def handle_endtag(self, tag):
  170. """
  171. Process the end of an HTML tag.
  172. :param tag: The name of the tag (a string).
  173. """
  174. if tag in ('a', 'b', 'code', 'del', 'em', 'i', 'ins', 'pre', 's', 'strong', 'span', 'u'):
  175. old_style = self.current_style
  176. # The following conditional isn't necessary for well formed
  177. # HTML but prevents raising exceptions on malformed HTML.
  178. if self.stack:
  179. self.stack.pop(-1)
  180. new_style = self.current_style
  181. if tag == 'a':
  182. if self.urls_match(self.link_text, self.link_url):
  183. # Don't render the URL when it's part of the link text.
  184. self.emit_style(new_style)
  185. else:
  186. self.emit_style(new_style)
  187. self.output.write(' (')
  188. self.emit_style(old_style)
  189. self.output.write(self.render_url(self.link_url))
  190. self.emit_style(new_style)
  191. self.output.write(')')
  192. else:
  193. self.emit_style(new_style)
  194. if tag in ('code', 'pre'):
  195. self.preformatted_text_level -= 1
  196. if tag in self.BLOCK_TAGS:
  197. # Emit an empty line after block level tags.
  198. self.output.write('\n\n')
  199. def handle_entityref(self, name):
  200. """
  201. Process a named character reference.
  202. :param name: The name of the character reference (a string).
  203. """
  204. self.output.write(unichr(name2codepoint[name]))
  205. def handle_starttag(self, tag, attrs):
  206. """
  207. Process the start of an HTML tag.
  208. :param tag: The name of the tag (a string).
  209. :param attrs: A list of tuples with two strings each.
  210. """
  211. if tag in self.BLOCK_TAGS:
  212. # Emit an empty line before block level tags.
  213. self.output.write('\n\n')
  214. if tag == 'a':
  215. self.push_styles(color='blue', bright=True, underline=True)
  216. # Store the URL that the link points to for later use, so that we
  217. # can render the link text before the URL (with the reasoning that
  218. # this is the most intuitive way to present a link in a plain text
  219. # interface).
  220. self.link_url = next((v for n, v in attrs if n == 'href'), '')
  221. elif tag == 'b' or tag == 'strong':
  222. self.push_styles(bold=True)
  223. elif tag == 'br':
  224. self.output.write('\n')
  225. elif tag == 'code' or tag == 'pre':
  226. self.push_styles(color='yellow')
  227. self.preformatted_text_level += 1
  228. elif tag == 'del' or tag == 's':
  229. self.push_styles(strike_through=True)
  230. elif tag == 'em' or tag == 'i':
  231. self.push_styles(italic=True)
  232. elif tag == 'ins' or tag == 'u':
  233. self.push_styles(underline=True)
  234. elif tag == 'span':
  235. styles = {}
  236. css = next((v for n, v in attrs if n == 'style'), "")
  237. for rule in css.split(';'):
  238. name, _, value = rule.partition(':')
  239. name = name.strip()
  240. value = value.strip()
  241. if name == 'background-color':
  242. styles['background'] = self.parse_color(value)
  243. elif name == 'color':
  244. styles['color'] = self.parse_color(value)
  245. elif name == 'font-style' and value == 'italic':
  246. styles['italic'] = True
  247. elif name == 'font-weight' and value == 'bold':
  248. styles['bold'] = True
  249. elif name == 'text-decoration' and value == 'line-through':
  250. styles['strike_through'] = True
  251. elif name == 'text-decoration' and value == 'underline':
  252. styles['underline'] = True
  253. self.push_styles(**styles)
  254. def normalize_url(self, url):
  255. """
  256. Normalize a URL to enable string equality comparison.
  257. :param url: The URL to normalize (a string).
  258. :returns: The normalized URL (a string).
  259. """
  260. return re.sub('^mailto:', '', url)
  261. def parse_color(self, value):
  262. """
  263. Convert a CSS color to something that :func:`.ansi_style()` understands.
  264. :param value: A string like ``rgb(1,2,3)``, ``#AABBCC`` or ``yellow``.
  265. :returns: A color value supported by :func:`.ansi_style()` or :data:`None`.
  266. """
  267. # Parse an 'rgb(N,N,N)' expression.
  268. if value.startswith('rgb'):
  269. tokens = re.findall(r'\d+', value)
  270. if len(tokens) == 3:
  271. return tuple(map(int, tokens))
  272. # Parse an '#XXXXXX' expression.
  273. elif value.startswith('#'):
  274. value = value[1:]
  275. length = len(value)
  276. if length == 6:
  277. # Six hex digits (proper notation).
  278. return (
  279. int(value[:2], 16),
  280. int(value[2:4], 16),
  281. int(value[4:6], 16),
  282. )
  283. elif length == 3:
  284. # Three hex digits (shorthand).
  285. return (
  286. int(value[0], 16),
  287. int(value[1], 16),
  288. int(value[2], 16),
  289. )
  290. # Try to recognize a named color.
  291. value = value.lower()
  292. if value in ANSI_COLOR_CODES:
  293. return value
  294. def push_styles(self, **changes):
  295. """
  296. Push new style information onto the stack.
  297. :param changes: Any keyword arguments are passed on to :func:`.ansi_style()`.
  298. This method is a helper for :func:`handle_starttag()`
  299. that does the following:
  300. 1. Make a copy of the current styles (from the top of the stack),
  301. 2. Apply the given `changes` to the copy of the current styles,
  302. 3. Add the new styles to the stack,
  303. 4. Emit the appropriate ANSI escape sequence to the output stream.
  304. """
  305. prototype = self.current_style
  306. if prototype:
  307. new_style = dict(prototype)
  308. new_style.update(changes)
  309. else:
  310. new_style = changes
  311. self.stack.append(new_style)
  312. self.emit_style(new_style)
  313. def render_url(self, url):
  314. """
  315. Prepare a URL for rendering on the terminal.
  316. :param url: The URL to simplify (a string).
  317. :returns: The simplified URL (a string).
  318. This method pre-processes a URL before rendering on the terminal. The
  319. following modifications are made:
  320. - The ``mailto:`` prefix is stripped.
  321. - Spaces are converted to ``%20``.
  322. - A trailing parenthesis is converted to ``%29``.
  323. """
  324. url = re.sub('^mailto:', '', url)
  325. url = re.sub(' ', '%20', url)
  326. url = re.sub(r'\)$', '%29', url)
  327. return url
  328. def reset(self):
  329. """
  330. Reset the state of the HTML parser and ANSI converter.
  331. When `output` is a :class:`~python3:io.StringIO` object a new
  332. instance will be created (and the old one garbage collected).
  333. """
  334. # Reset the state of the superclass.
  335. HTMLParser.reset(self)
  336. # Reset our instance variables.
  337. self.link_text = None
  338. self.link_url = None
  339. self.preformatted_text_level = 0
  340. if self.output is None or isinstance(self.output, StringIO):
  341. # If the caller specified something like output=sys.stdout then it
  342. # doesn't make much sense to negate that choice here in reset().
  343. self.output = StringIO()
  344. self.stack = []
  345. def urls_match(self, a, b):
  346. """
  347. Compare two URLs for equality using :func:`normalize_url()`.
  348. :param a: A string containing a URL.
  349. :param b: A string containing a URL.
  350. :returns: :data:`True` if the URLs are the same, :data:`False` otherwise.
  351. This method is used by :func:`handle_endtag()` to omit the URL of a
  352. hyperlink (``<a href="...">``) when the link text is that same URL.
  353. """
  354. return self.normalize_url(a) == self.normalize_url(b)