How to Clean Up HTML from Google Docs, Word, and Notion
Reviewed by Jerry Croteau, Founder & Editor
Table of Contents
Copy text from Google Docs and paste it into an HTML field. What could go wrong? Everything, apparently. The HTML that comes out of Google Docs, Microsoft Word, and Notion is some of the most bloated, redundant markup you'll ever encounter. A simple three-paragraph document can generate hundreds of lines of unnecessary code.
If you've ever pasted content from one of these tools into a CMS, email builder, or website editor and gotten bizarre formatting — random font changes, weird spacing, invisible styles overriding your design — this is why, and here's how to fix it.
What Each Tool Exports
Google Docs wraps every paragraph in a <span> with inline styles specifying font family, font size, font weight, color, and sometimes background color — even when the text uses default formatting. A single sentence might have three nested <span> tags, each setting the same font to "Arial." The result is HTML that's 5-10x larger than it needs to be.
Microsoft Word is worse. Word exports include proprietary mso- CSS properties (Microsoft Office-specific styles that no browser understands), conditional comments targeting specific Outlook versions, XML namespaces, and empty paragraphs (<p> </p>) used for spacing instead of margins. A one-page Word document can produce over 1,000 lines of HTML.
Notion is cleaner than the other two but still adds unnecessary wrapper <div> elements with class names that only make sense inside Notion's own renderer. It also exports custom data attributes and sometimes empty blocks that served as spacers in the Notion interface.
The Cleanup Workflow
Step 1: Paste the raw export into the HTML Previewer. See what you're starting with. The preview shows you what the browser actually renders — often it looks fine visually, but the underlying code is a mess.
Step 2: Hit Prettify. This formats the code with proper indentation so you can actually read it. Now you can see all those redundant <span> wrappers and mso- properties.
Step 3: Strip the junk. In the textarea, use Find and Replace (Ctrl+H or Cmd+H in most browsers) to remove the biggest offenders:
Google Docs cleanup: Remove all style="..." attributes to strip inline formatting back to defaults. If you need specific styles, add them back cleanly afterward. Search for style=" and look for patterns to mass-delete.
Word cleanup: Delete everything between <!--[if and <![endif]--> — these are conditional comments for Outlook that break in browsers. Remove any <o:p> and </o:p> tags (Office paragraph markers). Delete any CSS property starting with mso-.
Notion cleanup: Remove class="..." attributes (Notion's classes have no meaning outside Notion). Remove empty <div> wrapper elements that just add nesting without function.
Step 4: Simplify the structure. After stripping styles and classes, you'll often find deeply nested elements that serve no purpose. A paragraph that was <div><div><p><span>Hello</span></p></div></div> can become just <p>Hello</p>. Look for any element that has only one child and no attributes — it's probably unnecessary nesting.
Step 5: Check the preview. After each round of cleanup, glance at the preview to make sure you haven't removed something that was actually needed. The live preview makes this safe — you can see instantly if your cleanup broke something.
Step 6: Minify. Once the HTML is clean, hit Minify to compress it. Check the character count in the stats bar — you'll typically see a 60-80% reduction from the original export.
Before and After
Before (Google Docs export):
A typical paragraph from Google Docs looks something like this — a <p> tag with a class, containing a <span> with inline styles specifying font-size, font-family, color, and font-weight, all set to the default values. Three paragraphs of plain text can generate 30+ lines of HTML with over 1,000 characters of redundant style attributes.
After cleanup:
Those same three paragraphs become three simple <p> tags with just the text content — about 150 characters total. The visual output is identical because browsers apply the same default styles without being told.
When to Keep Styles
Not all exported styles are junk. If the original document used intentional formatting — bold text, colored headings, specific font choices that differ from the default — you'll want to keep those. The trick is to strip everything first, check the preview, and then add back only the styles you actually need.
A useful approach: strip all inline styles, then add a single <style> block at the top with your own clean CSS rules. This separates content from presentation and makes the HTML much easier to maintain.
Preventing the Problem
If you regularly move content from Google Docs or Word into web contexts, consider these alternatives:
Paste as plain text first (Ctrl+Shift+V instead of Ctrl+V). This strips all formatting and gives you clean text that you can mark up yourself.
Write in Markdown. Tools like Google Docs and Notion both support Markdown export, which produces much cleaner output than HTML export. Convert the Markdown to HTML using any converter, and you'll get minimal, semantic markup.
Use the HTML Previewer as your QA step. Paste every export into the previewer before using it anywhere else. A 30-second check prevents hours of debugging weird formatting in production.
Related Calculators
Get smarter with numbers
Weekly calculator breakdowns, data stories, and financial insights. No spam.
Discussion
Be the first to comment!