As I have previously mentioned I am rather fond of puppeteer. It's a useful library for all kinds of web automation...but like any open source project it needs some TLC.
I am not in any way associated with the developers at puppeteer, but if you are looking for a way to contribute, they are open source
The frustration
I was looking at a somewhat long page(think vertically) and tried to create a screenshot of it. The optimist in me was thinking that it will simply work so I went on as usual and planned my approach on the assumption that it will function as intended.
I checked the screenshot and found that it was a tiled image of a fixed size crop from the top of the file. First reaction was frustration...but I think it was more at myself that I had not allowed any margin for error in the experiment.
The insight
There is no reason to point fingers when something is not working, especially in OSS, if you have the chops fix it for yourself, share it, if it is good enough it might get adopted upstream. In other words perfect is the enemy of good.
The bug
Before focusing on hacking my way out of the jam I scoured the web, as usually problems are not as unique as one might think. I am ashamed to admit it, but I'm not fond of documentation and hacking my way out of a problem by digging into the different related projects' docs is the last step in my debugging journey.
I found that this was related to an old, still open bug in the puppeteer repo.
Discussion ongoing to quite recently...but still open.
The consensus I could gather is either use playwright or use a workaround to solve it in the puppeteer layer. The root cause of the bug is a websocket size limitation on the CDP protocol for chromium.
I had an intention of using playwright but in some of my tests it was failing to load some pages so I decided to revisit the puppeteer idea and solve the issue where I can.
Hacking my way through it
Started by doing a height based chunking method. A more generic approach was to create a chunker
that returns a function so that the chunk height is configurable via the parameter.
// return a chunker function with the height for each chunk
// number will be the full height of the element you want to // grab a screenshot of
const chunkBy = (n) => number => {
let chunks = new Array(Math.floor(number / n)).fill(n);
chunks = chunks.map((c, i) => {return {height: c, start: i * c}});
const remainder = number - chunks[chunks.length - 1].start - chunks[chunks.length - 1].height;
if (remainder > 0) {
chunks.push({height: remainder, start: chunks[chunks.length - 1].start + chunks[chunks.length - 1].height});
}
console.log('CHUNKS = ', chunks);
return chunks;
};
Afterwards I wrote the method for grabbing the screenshot that works regardless of the height of it so that it works around the CDP limitation.
async function grabSelectorScreenshot() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3');
// urls is a list of string urls
for (const url of urls) {
const hashed = crypto.createHash('sha256').update(url).digest('hex');
await page.goto(url, {waitUntil: 'networkidle0'});
// this is where the element is selected
const element = await page.$("div#document1 div.eli-container");
// get height and width for later iterating
const {width, height} = await element.boundingBox();
const designatedPathPng = `./screenshots/${hashed}-merged-ss.png`;
// chunk by 4000 height
const heights = chunkBy4k(height);
// keep track of starting point and height
// to have continuous mapping of the image
const chunks = heights.map((h, i) => {
return element.screenshot({
clip: {
x: 0,
y: h.start,
height: h.height,
width,
},
path: `./screenshots/${hashed}-${i}-ss.png`
})
});
// wait for all the part files to be written
const filesResolved = await Promise.all(chunks)
// merge all the parts in a vertical layout
const mergedImage = await mergeImg(filesResolved, {direction: true});
// this is interesting, the merged image is a promise,
// but the write only worked via a function callback
mergedImage.write(designatedPathPng, async () => {
browser.close();
const dataPng = await readFile(designatedPathPng);
const b64imgPng = Buffer.from(dataPng).toString('base64');
// clean up the temporary files created
await deleteFilesMatchingPattern('./screenshots', new RegExp(`^${hashed}-\\d+-ss\\.png$`));
return b64imgPng;
});
}
}
Cleaning up temporary files
You probably want to clean up the files. One way to do that:
async function deleteFilesMatchingPattern(dirPath, regex) {
try {
const files = await readdir(dirPath); // Read all files in the directory
for (let file of files) {
if (regex.test(file)) { // Check if the file matches the pattern
const filePath = path.join(dirPath, file);
await fs.unlink(filePath); // Delete the file
console.log(`Deleted: ${filePath}`);
}
}
} catch (error) {
console.error('Error:', error);
}
}
In hindsight, probably a better way to do this is by using actual tmp
files and decouple the cleanup, but this was good enough for a barebones script.
Conclusion
- OSS needs some TLC
- problems are rarely unique
- it's better to hack at it and unblock yourself, switching library is more of a PITA as there are no guarantees