How To Download The HTML Of Web Page Programmatically In Java

Let’s Take A Look At Multiple Methods Of Downloading The HTML Of A Page With Java

There are many times you are developing a Java application, and then you had to download the HTML of a web page. This can happen in many situations, including web scraping projects. This can also be useful to use on your own site (I once wrote an application that extracts the thumbnail of my blog posts, for example).

There are multiple ways to download HTML using Java, either using Java standard library, Java NIO, or an external library like JSoup. For each of these cases, I will give you a reusable function you can copy & paste into your code, since I know you’re probably in the middle of a task, and you don’t have the time for that.

Using Standard Java Library (Java IO)

The method below uses the standard Java IO library, and so it will work with older codes & versions of Java. The function below gives you the HTML in a string, instead of saving the HTML in a file:-

public static String DownloadHTML(String URL) {
String HTML = "";
URL url;
InputStream inputstream = null;
BufferedReader bufferedreader;
String line;
try {
url = new URL(URL);
inputstream = url.openStream();
bufferedreader = new BufferedReader(new InputStreamReader(inputstream));
while ((line = bufferedreader.readLine()) != null) {
HTML += line + "\n";
// System.out.println(line);
}
} catch (MalformedURLException m) {
m.printStackTrace();
} catch (IOException e) {
System.out.println(URL); // Print the URL, can help discover some bugs
e.printStackTrace();
} finally {
try {
if (inputstream != null) {
inputstream.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
return HTML;
}

Downloading HTML Using Java NIO

You can also download the HTML using Java NIO, the newer alternative to Java IO. It has multiple advantages over the original IO, which shortly are as follow:-

- NIO moves the most time-consuming I/O activities into the operating system, rather than making you write a code that copies the data to and from streams & buffers.
- NIO copies data as blocks of bytes at a time, rather than the streams used in Java IO, which copies data one bye at times. that results in increase in performance.
- NIO allows for the uses of direct buffer. It allows OS to use DMA, which in turn allows write into a memory mapped file.

This results in increase in performance, especially when you deal with large files. So while old Java IO do the work, use NIO if you can. Specially if you’re writing a new application.

The code below uses NIO to save the HTML contents in a file, which you can read at any time you want.

public static void DownloadHTMLUsingNIO(String URLA, String FilePath) {
URL url = null;
FileOutputStream fileoutputstream = null;
try {
url = new URL(URLA);
ReadableByteChannel bytechannel = Channels.newChannel(url.openStream());
fileoutputstream = new FileOutputStream(FilePath);
FileChannel fileChannel = fileoutputstream.getChannel();
fileoutputstream.getChannel().transferFrom(bytechannel, 0, Long.MAX_VALUE);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}finally {
if(fileoutputstream!=null) {
try {
fileoutputstream.close();
} catch (IOException e) {
System.out.println("Err, we couldn't even close the stream");
e.printStackTrace();
}
}
}
}

Download HTML Of A Webpage Using Jsoup

Jsoup is a Java library for parsing & manipulating HTML code. It can be used to fetch HTML, and extract certain elements from within the HTML using CSS selector. If you’re into using an external library, then JSoup is a good choice to download the HTML of any page. It does the dirty work for you so easily, and above that, you only need one line of code to use it:-

String html = Jsoup.connect(URL).get().html();

I have wrapped the code in a function for you here, and caught all the basic exception in it:-

public static String DownloadHTMLWithSoup(String URL) {
try {
String html = Jsoup.connect(URL).get().html();
return html;
} catch (IOException e) {
e.printStackTrace();
}
return null;
}

And Finally

I hope you my little tutorial has made it easier for you to download HTML codes, and wish you luck in whatever project you’re working on.

Sources & Useful Links

j-nio