At some point nearly every SaaS company receives a feature request to export user data. The request usually comes in the form of “I want to export my data to a <Microsoft Office Product> file.” Here at Sprout Social we’ve gotten these requests before and today we support CSV and PDF exports for nearly all of our reports. Implementing CSV exports was relatively straightforward. Implementing PDF exports, on the other hand, was a more complicated beast. In this article I want to share with you the history of Sprout Social’s PDF exports and some of the issues we ran into in the hopes that it may help you should you choose to go down a similar path.
Early in Sprout Social’s life sometime in 2010, we received one of those data export requests. Our users wanted PDF copies of our reports so they could easily share data with teammates that didn’t use Sprout Social. Unfortunately for us, the options for generating PDFs at the time were fairly limited. There were a few command-line tools but they weren’t very flexible and didn’t offer great CSS support. Many browsers at the time offered the ability to print a web page directly to a PDF file, but it was a bit cumbersome for users and it was very difficult to add print styles to your page if it wasn’t designed that way from the start. So instead of an existing solution we decided to build our own, and shortly thereafter, Papyrus was born.
Papyrus is the internal name for our first PDF report generation service. It’s a Java service that accepts a JSON payload as input and uses a library called iText to generate a PDF. Although some of the details are a bit complicated, using Papyrus to generate a PDF is relatively simple.
iText uses an XML-based markup language and a subset of CSS to create and style PDF documents. We know the layout of our PDFs beforehand, we just don’t know the content. Using Mustache, we can create templates of our reports that can be filled in with user data at generation time. Once we combine a user payload with the template to produce the full markup, iText can generate a PDF document to return to the user. We employ a few tricks to generate the PDFs—such as using Rhino and Highcharts to generate graphs—but a majority of the heavy lifting is done by iText. Most of our work lies in creating the templates for each of the reports.
While Papyrus has the benefit of simplicity, it also has a few drawbacks. Most notably, the templates are onerous to create and difficult to match to designs. We’re also forced to duplicate display logic in the markup and on the front-end, meaning that both back-end and front-end developers have to be involved in creating and modifying the reports. Because of these drawbacks, we started searching for alternatives in early 2014.
By 2014, PhantomJS was becoming increasingly popular in the web development world. Most usage was focused around browser automation and testing, but one of its lesser known features is its ability to perform screen captures. Relevant to our use case, it can capture the contents of any web page in a PDF file. Using this feature we set out to build a service that would generate PDF reports based on the contents of the report’s page in our app.
We soon had a prototype for a new PDF generation service that could take screenshots of our existing reports. It wasn’t an out-of-the-box solution, however. We had to modify several parts of our application to make the reporting pages compatible with the way we were using PhantomJS. Some of those changes included:
- CSS workarounds. PhantomJS 1 is based on older versions of WebKit, which led to a lot of our CSS not working in PDF mode. In most cases, we had to fall back to using IE9 workarounds for PhantomJS.
A Function.prototype.bindpolyfill. PhantomJS 1 notoriously doesn’t support
Function.prototype.bindeven though it implements most of the rest of the ES5 standard.
- Fonts. If you search for “PhantomJS fonts” you’re likely to come across an article that will show you how to get PhantomJS to recognize local fonts. Put the fonts in
/usr/share/fonts/truetypeand then run
fc-cache -fv. That works great until you also run into the issue where PhantomJS doesn’t implement the CSS font-family declaration correctly. This issue wasn’t found until we were in production and our Typekit fonts failed to load.
- A custom version of the reporting page. If PhantomJS took a screenshot of the report as-is the PDF would include a lot of unnecessary content such as navigation bars, headers, and footers. The page also wouldn’t look very good because the contents weren’t optimized to fit on a standard PDF page. In order to work around this we created another web page that would only render the content necessary for the PDF, and in a layout that made sense for a PDFs. This meant we had to duplicate some layout code, but a majority of the components (graphs, charts, media objects, etc) were still able to be re-used.
- Authentication. Because PhantomJS didn’t have the user’s cookies we had to choose between side-loading data on the page or finding a way to authenticate PhantomJS to make API requests on the user’s behalf. Because of security concerns at the time we opted to side-load the data onto the page. That meant the front-end would have to gather all of the necessary data and ship it to PhantomJS when exporting a report.
The workflow turned out to be rather complicated, but it worked.
- The user initiates a PDF export and the front-end gathers the required data in a JSON payload.
- A request is sent to the PDF service, which starts an instance of PhantomJS and points the browser to the reporting page.
- The user payload is injected onto the reporting page and the page uses the data to render the report.
- PhantomJS captures the page in a PDF that is uploaded to S3.
- The S3 URL is returned to the client and the PDF download is initiated.
It lacked the simplicity of Papyrus, but it alleviated some of the frustrations we had with Papyrus. Not only were the reports as vibrant as the web versions, but now all of the logic for PDFs lived in the web code. An entire report could be designed and implemented by the front-end team, making them easier to develop and easier to ship. Seeing the potential in the new method, we sought to improve the service.
The New PDF Generator Service
After working with our PhantomJS-based service for a while, we started to identify some areas that would could improve the workflow. Most notably:
- Testing PDFs was difficult. Because the way the service generated report URLs wasn’t configurable, developers had to set up their own instance of the service in order to test reports outside of production.
- We weren’t utilizing PhantomJS to its full potential. Our prototype worked, but we soon realized that PhantomJS had features that could simplify our workflow. For instance, the
onInitializedhook would allow us to inject data directly into the page instead of uploading it to a server only to have the page re-download it. We also never properly enabled the PhantomJS disk cache, which would cut down on page load times if we configured it correctly.
- The service used a fixed version of PhantomJS. We sometimes upgraded the version, but we had to upgrade every report at the same time. Making the version configurable would allow each report to operate independently of the others.
Using what we had learned from the first version we began to implement version 2 of the PhantomJS PDF generation service. We took a deep dive into PhantomJS’s documentation and source code and utilized more of its features. We were able to inject data directly into the page and enable the disk cache which resulted in our generation times dropping by as much as 40%. We made nearly every aspect of the service configurable, from the version of PhantomJS used to the URL of the report to the generation timeout.
In version 2 we made large strides in our error handling, since this was our biggest pain point. We utilize every error hook available in PhantomJS to ensure that any and all errors are captured in the log files. Errors are categorized by where they happen and how serious they are. They’re also given error codes to return to the client to help debug customer issues in production. Any request that fails in production is logged along with the contents of the payload, allowing us to reproduce the request later if needed. We also have a test page that sends raw payloads directly to the PDF generation service, allowing us to bypass the UI and the API when reproducing customer errors and reducing the amount of time it takes to find the cause. Because of the increased error-handling surface area, we saw our service losses go from one or two a month to zero in the last 16 months.
As part of our refactor we also modified our front-end code to create payloads that were smaller. Instead of sending raw request data to the service—most of which wasn’t used—we began to send processed, aggregated data. In some cases we cut down the payload size by a factor of 10. These changes combined with the efficiencies mentioned above means that reports are now taking 5 to 6 seconds to generate instead of the previous 20 to 25 seconds. And that time continues to decrease as we continue to make optimizations and switch more of our rendering logic to React.
Since we launched the new PDF service 16 months ago updates have been few and far between. Its flexibility has allowed us to add new reports without any changes to the service. And the reliability of both the service and PhantomJS 2 has allowed us to start designing larger features around PDFs without worrying about scalability. This isn’t the final chapter in the book of PDFs at Sprout Social, but we are in a good place and we’re excited to see what the future holds for us and our customers.