TL;DR: By structuring 10,000 comments on Hacker News using a predefined response_scheme for GPT-4o and LangChain’s ability to force structured output from the LLM, I’ve tried to get a feeling for current job market and trends.
why
Looking back at pictures of my NYC visit last year, I thought: why not live there at some point?
But since NYC’s costs of living are a multiple of my current conditions, the only possibility of me ever being able to afford moving to NYC would be by having a job there. Such job would have to be:
- sponsoring visa
- fitting my skills, tech stack and experience level
- remote or located in nyc
Skimming through some job postings I quickly got the feeling that $p(\exists \text{ job} \in {\text{jobs}} : \text{job} = \text{offered})$ is not high at the moment. While I would think I have a good feeling of the current needs and trends for a person working in the software industry, the only references I have are X (formerly Twitter) and LinkedIn, both of which I would say are not the most unbiased or sincerest sources, since timelines aren’t really timelines anymore and algorithms are picking content with the maximum likelihood of giving you a dopamine shot. Hacker News though, with its simplistic approach to design and ranking, seemed to be a more honest representation of the job market (that the HN-API is dead-simple obviously didn’t play any role). So how realistic is it for me to find such a job, actually? While this may not be quantifiable (at least for me), I can at least quantify other things.
In this blog post, I want to demonstrate that by combining LLMs, features like JSON-Mode and classic data science methods, one can get an understanding of any topic really quick.
methodology
My process looked roughly like this:
- Using Selenium, I used a script to google iteratively for strings
query = f"ask hn who is hiring {month} {year}"
to get the IDs of the items that represent the monthly threads. - Then, using the aforementioned HN-API, I’ve gathered a list of the IDs of the top-comments on these threads. Click here to see how such a list looks like for the latest “Ask HN: Who is hiring? (July 2024)” thread. Using the API again, I’ve saved the content of the comment into an sqlite3 database.
- Afterwards, I’ve iterated through the comments, and classified them with GPT-4o using the following scheme:To understand how this is implemented, you can lookup this LangChain guide. The v0.2 docs are slightly better than the previous ones, but I really hope that LangChain will finally agree on an API and start to document it nicely. Notably, the
class HNJobPosting(BaseModel): """Job posting from Hacker News.""" comment_id: int = "The ID of the comment that contains the job posting, given at the beginning." location: str = Field(description="The location of the job") remote: bool = Field(description="Indicates if the job is remote - if not known, set to false. HAS TO BE A BOOLEAN VALUE!") job_type: str = Field(description="The type of job (e.g., full-time, part-time, contract, intern, etc.)") salary_range: str = Field(description="The salary range offered for the job") # and so forth...
bool
type itself is not sufficient to reliably get back aFalse
orTrue
- the passive aggressive HAS TO BE A BOOLEAN VALUE! has helped to fix this. - The results were saved in the database and I’ve used classic SQL to extract the data that is visualized in the following charts.
By using LangChain’s llm.batch(array)
method the processing is parallelized and was pretty fast, around 1 minute per thread. See below for some stats:
thread | total tokens | input | output | requests | cost | runtime |
---|---|---|---|---|---|---|
June 2024 | 260,989 | 215,580 | 45,409 | 362 | 1.759035$ | 65.1s |
May 2024 | 303,133 | 250,168 | 52,965 | 433 | 2.045315$ | 57.7s |
April 2024 | 228,113 | 189,088 | 39,025 | 318 | 1.530815$ | 43.5s |
…
While the scraping and storing of the comments is cheap and fast, you notice that the costs of running these batches is not. Therefore, I’ve let the script run back until May 2022. This resulted in 10,891 comments processed, 54.09$ spent and ranks as one of my more expensive Sunday evening boredom activities. Writing the code, extracting the data from the API and running the processing on the comments (excluding the cleanup and visualizations) took around 90 minutes. At this point I’d like to thank Github Co-Pilot to write me my boilerplate code.
Since Hacker News is using incremental IDs, we can visualize the activity on Hacker News by plotting the numerical differences between the thread indices per month:
So, let’s try to get some informations out of the categorized comments.
results
how many jobs allow remote working?
Unsurprisingly, during the pandemic, only a fifth of the jobs did not explicitly support remote working. Surprisingly however, this has not decreased as much as I would have expected.
how many jobs sponsor visa?
The portion of visa sponsoring jobs seems to be relatively stable over the last two years, with just minor decreases. Nevertheless, it is still challenging to be offered one of those. If a LeetCode affiliate programm would exist, this could be a good place to refer to an annual subscription.
what is the experience level distribution over time?
Better get your eight years of experience in the field within the next six to twelve months!
how many jobs per state in the US?
Since so many more jobs are offered in the Bay Area or NYC than the rest of the US, you can switch between logarithmic and linear color grading by clicking here for a better overview.
which databases are used?
Like above, PostgreSQL just completely overshadows other databases in usage. Since logarithmic scaling seemed not that bad in the chart above, I’ve used it again in this bar chart to give databases other than PostgreSQL some kind of scale. And I hate it.
what javascript frameworks are in demand?
React’s predominance is even greater than PostgreSQL’s, so the only way to make this diagram not look ridiculous would be to use a logarithmic scale again, against which I have an innate averstion to, as I have just realized some minutes earlier. Instead I’ve spent two hours to build a bubble-chart in 300 lines of code with three.js
that still shares some of the aesthetics of the chart.js
charts above plus some kind of interactivity using OrbitControls
and a RayCaster
. Is this better than logarithmic scales? I doubt it. This probably also would have been possible with d3.js
as well and could potentially looked at least double as nice, but three.js
just clicks for me and I enjoy every chance I get to build something with it. You can zoom, move and pan inside the canvas. Spheres should not collide, but if the arrangement is too chaotic nonetheless, you can click here to reshuffle and recolor the chart, which is currently rendering at … FPS.
what is the salary distribution?
learnings
You have to describe your model fields as precisely as possible.
Location was initially structured with following description in my code: location: str = Field(description="The location of the job")
. This is inherently abstract and does not really give any guidelines on how exactly the location string should look like. I’ve later split this field into city and country and described it like following:
city: Optional[str] = Field(description="Name of the city in the string, if given. If not definitive, set to 'n/a'!")
country: Optional[str] = Field(description="The name of the country in the string, if given. Use the ISO 3166-1 alpha-2 code (e.g., 'US' for United States). If not definitive, set to 'n/a'")
When categorizing, declare the classes in the description
For example, I’ve had job_type: str = Field(description="The type of job (e.g., full-time, part-time, contract, intern, etc.)")
as a description. Instead of giving examples, I should’ve given the concrete categories I am looking for.
When extracting a set, give the delimiter in the description.
So instead of technologies: str = Field(description="The technologies and tools required for the job")
, do technologies: str = Field(description="The technologies and tools required for the job, listed as a comma-separated string")
, so that you can split it later in a clear way.
future work
An idea that I got while writing this post was that with the initial work done, a mini SaaS could be built, in which a user can describe the jobs he is currently looking for in Ask HN: Who is hiring? threads, which then will be categorized and matched against the categorized comments on a monthly basis.