Efficient Image Metadata Extraction with Java

Java has a rich set of tools for processing images built in to the standard library. However, it’s not always clear how to use that library to perform even simple tasks. There are already lots of great guides out there for working with images once they’re loaded… but what can Java do without ever loading the image into memory at all?

When working with images from untrusted sources — for example, images discovered during a web crawl — it’s best to treat data defensively. This article will show how to perform some useful tasks on images without ever loading their pixel data into memory.

Analyzing an Existing Image File

The first step in extracting image metadata from a file containing image data without loading the pixel data into memory is finding an appropriate ImageReader for working with the image. The easiest way is simply to use the file itself:

public static record ImageFileMetadata(
  int firstImageWidth,
  int firstImageHeight,
  String formatName,
  List<String> formatMimeTypes,
  List<String> formatFileSuffixes) {
}

public static Optional<ImageMetadata> getImageMetadata(File imageFile)
    throws IOException {
  try (ImageInputStream in = ImageIO.createImageInputStream(imageFile)) {
    Iterator<ImageReader> readers = ImageIO.getImageReaders(in);
    while (readers.hasNext()) {
      ImageReader reader = readers.next();
      try {
        reader.setInput(in);

        // Some image formats, like TIFF and Animated GIF, can store
        // multiple images in one file. This is (relatively) rare. In
        // this example, we simply return information about the first
        // image in the file. It's not hard to iterate over all images
        // in the file using ImageReader#getNumImages(boolean) if needed.
        int firstImageWidth = reader.getWidth(0);
        int firstImageHeight = reader.getHeight(0);
        String formatName = reader.getFormatName();
        List<String> formatMimeTypes = Optional
          .ofNullable(reader.getOriginatingProvider())
          .map(ImageReaderSpi::getMIMETypes)
          .map(Arrays::asList)
          .map(Collections::unmodifiableList)
          .orElseGet(Collections::emptyList);
        List<String> formatFileSuffixes = Optional
          .ofNullable(reader.getOriginatingProvider())
          .map(ImageReaderSpi::getFileSuffixes)
          .map(Arrays::asList)
          .map(Collections::unmodifiableList)
          .orElseGet(Collections::emptyList);

        if (firstImageWidth > 0
            && firstImageHeight > 0
            && formatName != null
            && !formatMimeTypes.isEmpty()
            && !formatFileSuffixes.isEmpty()) {
          return Optional.of(
            new ImageFileMetadata(
              firstImageWidth,
              firstImageHeight,
              formatName,
              formatMimeTypes,
              formatFileSuffixes));
        }
      } finally {
        reader.dispose();
      }
    }
  }
  return Optional.empty();
}

Analyzing an Image from the Internet

The above example assumes the image file is already available. The following method downloads an image from the internet to a file first, then analyzes it:

public static Optional<ImageMetadata> getImageMetadata(URI uri)
    throws IOException, InterruptedException {
  File imageFile = File.createTempFile("image.", ".download");
  try {
    HttpResponse<InputStream> response = HttpClient.newHttpClient()
      .send(HttpRequest.newBuilder(uri)
          .GET()
          .timeout(Duration.ofSeconds(30L))
          .build(),
        BodyHandlers.ofInputStream());
    try (InputStream in = response.body();
        OutputStream out = new FileOutputStream(imageFile)) {
      long currentLength = 0L;
      byte[] buffer = new byte[16384];
      for (int nread = in.read(buffer);
          nread != -1;
          nread = in.read(buffer)) {
        out.write(buffer, 0, nread);
        currentLength = currentLength + nread;
        if (currentLength > maxLength)
          return Optional.empty();
      }
    }

    if (response.statusCode() != 200)
      return Optional.empty();

    String contentType = response.headers()
      .firstValue("content-type")
      .map(String::toLowerCase)
      .orElse("image/*");
    if (!contentType.startsWith("image/"))
      return Optional.empty();

    return getImageMetadata(imageFile);
  } finally {
    imageFile.delete();
  }
}

The above code is a pretty standard HTTP request using the new HttpClient from Java 11. However, note some basic defensive measures for downloading content from the internet: a maximum file length in bytes and a timeout.

Depending on the use case, several improvements could be made:

  • If images are cached, then using caching request headers like If-Modified-Since and accepting 3XX response codes would improve efficiency.
  • If the use case is more sensitive, then discarding images without a Content-Type header may be preferred.
  • Several third-party libraries offer a cleaner way of copying N bytes from InputStream to OutputStream, e.g., Google guava, Apache commons-io.
  • Throwing exceptions as opposed to returning Optional.empty() may be preferred in some use cases.
  • Logging would improve problem determination.

Further Reading

Interested readers may also enjoy the following pages:

License

As always, to the extent that anyone needs a license for code on this site, please consider the above to be available under the CC0 “Public Domain” license.