I need to navigate to a web site that ultimately contains a .pdf file and I want to save that file locally. I am using CEFSharp to do this. The nature of this site is such that once the .pdf appears in the browser, it cannot be accessed again. For this reason, I was wondering if once you have a .pdf displayed in the browser, is there a way to access the source for that file in the cache?
I have tried implementing IDownloadHandler and that works, but you have to click the save button on the embedded .pdf. I am trying to get around that.
OK, here is how I got it to work. There is a function in CEFSharp that allows you to filter an incoming web response. Consequently, this gives you complete access to the incoming stream. My solution is a little on the dirty side and not particularly efficient, but it works for my situation. If anyone sees a better way, I am open for suggestions. There are two things I have to assume in order for my code to work.
Start with the CEF Minimal Example found here : https://github.com/cefsharp/CefSharp.MinimalExample
I used the WinForms version. Implement the IRequestHandler and IResponseFilter in the form definition as follows:
public partial class BrowserForm : Form, IRequestHandler, IResponseFilter
{
public readonly ChromiumWebBrowser browser;
public BrowserForm(string url)
{
InitializeComponent();
browser = new ChromiumWebBrowser(url)
{
Dock = DockStyle.Fill,
};
toolStripContainer.ContentPanel.Controls.Add(browser);
browser.BrowserSettings.FileAccessFromFileUrls = CefState.Enabled;
browser.BrowserSettings.UniversalAccessFromFileUrls = CefState.Enabled;
browser.BrowserSettings.WebSecurity = CefState.Disabled;
browser.BrowserSettings.Javascript = CefState.Enabled;
browser.LoadingStateChanged += OnLoadingStateChanged;
browser.ConsoleMessage += OnBrowserConsoleMessage;
browser.StatusMessage += OnBrowserStatusMessage;
browser.TitleChanged += OnBrowserTitleChanged;
browser.AddressChanged += OnBrowserAddressChanged;
browser.FrameLoadEnd += browser_FrameLoadEnd;
browser.LifeSpanHandler = this;
browser.RequestHandler = this;
The declaration and the last two lines are the most important for this explanation. I implemented the IRequestHandler using the template found here: https://github.com/cefsharp/CefSharp/blob/master/CefSharp.Example/RequestHandler.cs I changed everything to what it recommends as default except for GetResourceResponseFilter which I implemented as follows:
IResponseFilter IRequestHandler.GetResourceResponseFilter(IWebBrowser browserControl, IBrowser browser, IFrame frame, IRequest request, IResponse response)
{
if (request.Url.EndsWith(".pdf"))
return this;
return null;
}
I then implemented IResponseFilter as follows:
FilterStatus IResponseFilter.Filter(Stream dataIn, out long dataInRead, Stream dataOut, out long dataOutWritten)
{
BinaryWriter sw;
if (dataIn == null)
{
dataInRead = 0;
dataOutWritten = 0;
return FilterStatus.Done;
}
dataInRead = dataIn.Length;
dataOutWritten = Math.Min(dataInRead, dataOut.Length);
byte[] buffer = new byte[dataOutWritten];
int bytesRead = dataIn.Read(buffer, 0, (int)dataOutWritten);
string s = System.Text.Encoding.UTF8.GetString(buffer);
if (s.StartsWith("%PDF"))
File.Delete(pdfFileName);
sw = new BinaryWriter(File.Open(pdfFileName, FileMode.Append));
sw.Write(buffer);
sw.Close();
dataOut.Write(buffer, 0, bytesRead);
return FilterStatus.Done;
}
bool IResponseFilter.InitFilter()
{
return true;
}
What I found is that the PDF is actually downloaded twice when it is loaded. In any case, there might be header information and what not at the beginning of the page. When I get a stream segment that begins with %PDF, I know it is the beginning of a PDF so I delete the file to discard any previous contents that might be there. Otherwise, I just keep appending each segment to the end of the file. Theoretically, the PDF file will be safe until you navigate to another PDF, but my recommendation is to do something with the file as soon as the page is loaded just to be safe.