Search code examples
javascriptnode.jstypescriptweb-scrapingcheerio

How to loop through a Cherrio inside an async function and populate an outside variable?


I need to create an API that web scraps GitHub's repos getting the following data:

  1. File name;
  2. File extension;
  3. File size (bytes, kbytes, mbytes, etc);
  4. File number of lines;

I'm using Node with TypeScript so, to get the most out of it, I decided to create an interface called FileInterface, that has the four attributes mentioned above.

enter image description here

And of course, the variable is an array of that interface:

let files: FileInterface[] = [];

Let's take my own repo to use as an example: https://github.com/raphaelalvarenga/git-hub-web-scraping

enter image description here

So far so good.

I'm already pointing to the HTML's files section with request-promise dependency and storing them in a Cheerio variable so I can traverse through the "tr" tags to create a loop. As you might think, those "tr" tags represent each files/folders inside of a "table" tag (if you inspect the page, it can easily be found). The loop will fill a temp variable called:

let tempFile: FileInterface;

And at the end of every cycle of the loop, the array will be populated:

files.push(tempFile);

In GitHub repo's initial page, we can find the file names and their extension. But the size and total of lines, we can't. They are found when clicking on them to redirect to the file page. Let's say we clicked in README.md:

enter image description here

Ok, now we can see README.md has 2 lines and 91 Bytes.

My problem is, since this will take a long time, it needs to be an async function. But I can't handle the loop in Cheerio content inside the async function.

Things that I've tried:

  1. Using map and each methods to loop through it and push in the array files;
  2. Using await before the loop. I knew this one wouldn't actually work since it's just a loop that doesn't return anything;
  3. The last thing I tried and believed that would work is Promise. But TypeScript accuses Promises return the "Promise unknown" type and I'm not allowed to populate the result in files arrays, since the types "unknown" and "FilesInterface[]" are not equal.

Below I'll put the code I created so far. I'll upload the repo in case you want to download and test (the link is at the beginning of this post), but I need to warn that this code is in the branch "repo-request-bad-loop". It's not in the master. Don't forget because the master branch doesn't have any of this that I mentioned =)

I'm making a request in Insomnia to the route "/" and passing this object:

{
   "action": "getRepoData",
   "url": "https://github.com/raphaelalvarenga/git-hub-web-scraping"
}

index-controller.ts file:

enter image description here

As you can see, it calls the getRowData file, the problematic one. And here it is.

getRowData.ts file:

enter image description here


Solution

  • I will try to help you, although I do not know typescript. I redid the getRowData function a bit and now it works for me:

    import cheerio from "cheerio";
    import FileInterface from "../interfaces/file-interface";
    import getFileRemainingData from "../routines/getFileRemaningData";
    
    const getRowData = async (html: string): Promise<FileInterface[]> => {
    
        const $ = cheerio.load(html);    
    
        const promises: any[] = $('.files .js-navigation-item').map(async (i: number, item: CheerioElement) => {
            const tempFile: FileInterface = {name: "", extension: "", size: "", totalLines: ""};
            const svgClasses = $(item).find(".icon > svg").attr("class");
            const isFile = svgClasses?.split(" ")[1] === "octicon-file";
    
            if (isFile) {
                // Get the file name
                const content: Cheerio = $(item).find("td.content a");
                tempFile.name = content.text();
    
                // Get the extension. In case the name is such as ".gitignore", the whole name will be considered
                const [filename, extension] = tempFile.name.split(".");
                tempFile.extension = filename === "" ? tempFile.name : extension;
    
                // Get the total lines and the size. A new request to the file screen will be needed
                const relativeLink = content.attr("href")
                const FILEURL = `https://github.com${relativeLink}`;
    
                const fileRemainingData: {totalLines: string, size: string} = await getFileRemainingData(FILEURL, tempFile);
    
                tempFile.totalLines = fileRemainingData.totalLines;
                tempFile.size = fileRemainingData.size;
            } else {
                // is not file
            }
    
            return tempFile;
        }).get();
    
        const files: FileInterface[] = await Promise.all(promises);
    
        return files;
    }
    
    export default getRowData;