How To Ask Software Question on Aardvark

Aardvark logo for vark.com Aardvark is was a network of users who answer each others questions. I joined Aardvark on May 8, 2009, and finally received my first question yesterday.

I am blogging about it because:

  1. question was ambiguous
  2. it is common problem
  3. I solved it with one line of code!

In topic of “Regular Expressions,“ Stephan, a fellow on the other side of the world, asked:

Looking for a regular expression which will remove all the <img> tags from a string. The string is a HTML document.

Stephan provided an example like this:

<td><img alt="" src="w" style="width: 20px; height: 1px;"/></td>
<td><img alt="" src="x" style="width: 1px; height: 1px;"/></td>
<td><img alt="" src="y" style="width: 20px; height: 1px;"/></td>
<td><img alt="" src="z" style="width: 1px; height: 1px;"/></td>

Let me explain why this question is hard to answer.

Regular Expressions

A regular expression is a pattern of characters, used to search for a set of objects. A simple example is:

[hds]ad

This would match “had,” “dad,” “sad.” Regular expressions, are common, but they are implemented differently everywhere. For example, my favorite text editor, Vim, which I have been using for over a dozen years, provides additional features, using a special syntax. I would have used Vim if I was doing this for myself. This simple example should work on all implementations of regular expressions. However, advanced features are implemented differently by other products, and programming languages.

For example, I never learned Perl programming language, which includes Perl Compatible Regular Expressions. So remember to be specific if you ask someone a question about regular expressions, or anything related to programming.

Operating Systems

Operating systems which are derived from Unix, such as Linux, or Mac OS/X include advanced text utilities, which are not packaged with Microsoft Windows. Although Vista was installed on my latest Thinkpad, I only tried it for ten minutes, before giving up, and installing Linux.

What Is Wrong With Aardvark?

I would not have responded if I was asked “how to remove images from html file on Windows” – “how to remove images from html file with Perl” – or many other possibilities. Aardvark needs better instructions, so users can ask, and respond, to questions more efficiently.

Assumptions are often wrong, but I really wanted to answer my first question, so I assumed that Stephan was not using Microsoft Windows, and provided a response which should work on any Unix-type system.

How To Remove Images From HTML on Linux, or Mac OS/X

Unix based systems include sed or “stream editor.” This is description from sed manual page on my favorite computer:

Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed’s ability to filter text in a pipeline which particularly distinguishes it from other types of editors.

In other words, sed is perfect tool to remove images from HTML files! After saving sample as “input.html” – I was able to remove images from the sample, and save it as “output.html” with this command:

sed 's/<img .*\/\{1\}>1\}//g' input.html > output.html

output.html contained:

<td></td>
<td></td>
<td></td>
<td></td>

Perfect! I hope I helped Stephan, but I will never know, since neither Aardvark, nor Stephan, informed me if my answer was helpful.

Another One Line Program

Those familiar with HTML, know that certain characters, like < and > must be encoded as “entities” – e.g., &lt; and &gt;. So I wrote another one liner, in PHP, for this article:

<?php echo htmlspecialchars(file_get_contents($argv[1])); ?>

I saved that line as “html2text.php” and entered:

php -f html2text > input.html

Then I copied and pasted results into this article.

Now you know how to remove images from HTML, and also how to convert HTML to include in your Web pages, with just two lines of code!

NOTE: I like my one liner, but you can make it easier to use, by adding a second line, and converting it to a “shell script.” See Example #1 on Using PHP from the command line for more information.

Calling Aardvark!

Aardvark, which is running a computer-based business, should understand the differences between applications, and operating systems. I sent them a link to this article, and I hope they use my suggestions, to improve their innovative service.

Update

Google bought Aardvark for $50 million on February 11, 2010. Google announced that it was closing Aardvark on September 2, 2011.

Comments

Alison

Mitch, thanks for the detailed post! We completely agree that 1. the more detail people offer in asking, the better and 2. you should receive some feedback on your answer. We currently have a whole team working on ways to encourage great questions on Aardvark and to encourage follow-up and ‘thanks’. We believe that this is really what separates Aardvark from many other search experiences. I’d *love* to hear any suggestions that you or your readers have on this matter

– Alison @ Aardvark

Mitch

Alison,
Thanks for your reply. Aardvark can increase its value, by offering a unique entry form for software questions, which require additional background information.

Best wishes, Mitch

John

I liked it. So much useful material. I read with great interest.