Re: Tool to Analyze Text for Possible Snippets

Subject: Re: Tool to Analyze Text for Possible Snippets
From: "Peter Neilson" <neilson -at- windstream -dot- net>
To: techwr-l -at- lists -dot- techwr-l -dot- com
Date: Thu, 12 Apr 2018 17:53:51 -0400

Jobs like this are often easily handled by the software tools within Unix or Linux, or by clever use of emacs macros. For your purposes, though, the time involved in learning sed, grep, awk, and such tools, or (even worse) the time to become a good emacs hacker, would be a roadblock. I might suggest that you find a friendly local hacker (the original white-hat meaning) who knows how to use those tools.

Your hacker will probably say, "Export everything to .txt files and I'll work on them."

On Thu, 12 Apr 2018 15:59:42 -0400, Paul Hanson <twer_lists_all -at- hotmail -dot- com> wrote:


I am looking at 8 different Word documents. The end game for these documents is to import them into my HAT (RoboHelp 2015) and maintain them in HTML. No problem - I know how to do all that.

What I want to pick your brains about is how to determine the frequence of the duplicated text. I know there is duplicate text across the documents because I took the 8 Word documents, inserted each into a single Word document, stripped out the graphics, and sorted the paragraphs.

I ended up with 280 sentences.

Sure, I can visually scan the list and find a sentence like this - "Create and confirm a 4-digit Citrix PIN." - and see that it exists twice. I know I could paste the list of 280 sentences into Excel and remove the rows that are duplicated - that's NOT what I'm looking for.

Instead, I'm looking for something close to this site:, BUT I want to know how many times a sentence exists. For example, I pasted in the 280 sentences and the site came back with this information:
Some top phrases containing 8 words (without punctuation marks) Occurrences
configure secure hub configure secure hub configure secure 4
However, that text is the following text:
Configure Secure Hub
Configure Secure Hub
Configure Secure Hub
Configure Secure Hub
Configure Secure Hub
Configure Secure Hub
So what I want to do is paste in the 280 sentences and get a report that "Configure Secure Hub" exists in the list of 280 "6" times.

Have you found an easy way to do this?

The next step, after I figure out how to get the list of duplicated text is to generate .hts files (snippet files that RoboHelp recognizes) so that I can analyze the text outside of RoboHelp, create the .hts files, import the snippets into RoboHelp and then run find and replace actions to replace "Configure Secure Hub" with the reference to the snippet that will store the "Configure Secure Hub" text. I know how to create the snippet file, using a DOS command to "Copy [template.hts file] [name of snippet file]" but have yet to figure out how to get the actual text I want to store in the snippet INTO the snippet without manually pasting the text - Configure Secure Hub - into the snippet... but that's after I figure out to analyze the text automatically to know that "Configure Secure Hub" is repeated 6 times in the 280 sentences.
Visit TechWhirl for the latest on content technology, content strategy and content development |


You are currently subscribed to TECHWR-L as archive -at- web -dot- techwr-l -dot- com -dot-
To unsubscribe send a blank email to
techwr-l-leave -at- lists -dot- techwr-l -dot- com

Send administrative questions to admin -at- techwr-l -dot- com -dot- Visit for more resources and info.

Looking for articles on Technical Communications? Head over to our online magazine at

Looking for the archived Techwr-l email discussions? Search our public email archives @

Tool to Analyze Text for Possible Snippets: From: Paul Hanson

Previous by Author: Re: Agile, Jira tickets, and document planning
Next by Author: Re: What are the worst things that have happened due to content mistake?
Previous by Thread: Tool to Analyze Text for Possible Snippets
Next by Thread: Re: Tool to Analyze Text for Possible Snippets

What this post helpful? Share it with friends and colleagues:

Sponsored Ads