How to use AI to Find a Home
Finding a rental in a competitive market can be tough. Sometimes, you might find an apartment or townhouse complex that's ideal, but there's no availability. Increasingly, companies refuse to manage wait lists, happy to handle things in a first come, first serve basis because that's easier and therefore cheaper. I could go on a long rant about how real-estate types are lazy and worthless relative to how much people are expected to pay in rent....but, well, that's not the point.
Ah, the Classic Scraper
Web scrapers are a fairly routine thing to build for anyone that's been in the biz long enough. At some point, someone wants to dabble in the dubious legality of scraping content off some site -- usually to use (steal) that content for something business related. That's the scammy world of scrapers, but these tools are actually super useful as a consumer.
With the advent of AI, writing one has never been easier, either. Scrapers can be very procedural and specific, a good use case for AI-created programs that don't need a mess of flexibility or production-level robustness. Nah, they just need to programmatically read a website and email me the result. Writing these programs is easy, but can also be annoying and time consuming if you aren't in that world all the time.
Guiding the AI with Platform Knowledge
As I've said before, the best users of generative AI are people that have real expertise in the content being created. One company reportedly spends $15,000 a month on a single artist because they use AI to churn out massive amounts of content. This works because the artist is doing more than just typing in a prompt; it doesn't go into a lot of details, but they are doing manual touch-ups and perhaps even priming the generative model with a LORA (or similar) to "guide" the style. If you are an artist, the fear shouldn't entirely be random business leaders that think they can be artists with Midjourney, it's actually other artists that know what the hell they are doing and can use that tool to create good content (albeit with dubious copyright status).
Regardless, at least for now, you need to have knowledge in the domain if you want generative AI to actually work. That said, GTP-4 was able to "think" through the problem and give me the exact solution I knew I wanted to use. When posed with this prompt, guess what it told me?
I need to create a script that will programmatically obtain HTML from a website twice a week, then email it to me.
I want this to run without a high server cost. What is the best way to do this?
Now, I've given the LLM a huge hint, here, because I mention that "I do not want high server costs". Leaving that line out yields an entirely different (and far less useful) suggestion! As we already know, this makes sense. The LLM can't really think strategically in the way a human can, it relies on these subtle hints in the prompt to build a more useful vector cloud more likely to fulfill our request.
Even though the LLM can actually tell me to "use Lambda" as I wanted, if you aren't at least a little experienced in this world (enough to phrase the question carefully), it won't give you a great strategy.
Writing the Lambda
With the following prompt, we can actually get most of what we want:
Write a lambda function in javascript that obtains the HTML from the URL "google.com", then extracts the html for the #container element. Finally, this script should send that HTML content as an email using SES.
The code for this is actually the easiest part. If you haven't worked in Lambda before, you'll likely need to quiz the LLM on what to actually do with this code. It isn't that it's hard to deploy, but AWS is notoriously dense with its documentation and UX. Trying to find a simple answer to a simple question isn't always a joyous experience.
The LLM can help with that a lot. It will tell you how to zip your folder and deploy it, and it will guide you through the simple steps required to schedule that function via EventBridge or set up your email to work through SES. Again, these aren't difficult things....but they are clunky and can become time consuming if you aren't familiar with the AWS UI or product stack, and that's where an LLM can really help.
With the lambda deployed and scheduled, I now receive a simple email twice a week showing rental houses listed for a specific company.
Wonderful, some tech asshole has engineered a way to get an advantage by using a scheduled, Serverless scraper to scan rentals...? Yeah, I'm not sure how I feel about it, either. You shouldn't need to write your own code just to be able to find good rentals, and it's grim to imagine having to compete against someone with "bespoke automation" in place just for this purpose.
Conclusion
Scrapers are potential tools for consumers, even though they are mostly used for vanilla content theft. While AI might not be able to completely engineer a scraper without some careful guidance, it mostly can...with all this structure and knowledge in place, writing a second version of this scraper to extract different data would be far faster.
I could also combine this idea with an RAG-based web scraper. With this, I could first "load" an LLM with context about the specific site I want to crawl using something like langchain to build a RAG. Then I could (in theory) more reliably quiz it to build a scraper (or some other process) based on that "knowledge".
I say "damage" because that's how scammers will use this tech. For example, let's say I deploy this idea against 1000s of websites and develop a specific pipeline where I ask the LLM to extract email addresses, then pipes those addresses to some persistent store.
At that point, it's working like a conventional scraper, just using the power of NLP to parse. The power of NLP could even allow it to reverse engineer some emails that are "obfuscated" via text ("eric at whatever dot com" is effortless for gpt-4 to turn back into an email, just ask it). Take it to another level, though, and you could somewhat easily ask an NLP to craft some spam messages -- even using the context of the site the email was exfiltrated from to guide the prompt to something more compelling and bespoke than people expect.
While the underlying tech is neat and the technique is simple and valuable, it does beg questions about fairness -- it's easy to claim that everyone has the same level of access to online data. Anyone can check the site whenever they want, and it isn't like the first person to apply is guaranteed a rental...so that's fine, right? Meh.
No, it obviously isn't fair. My Lambda could run every freakin' hour and still cost very little. Someone working a 9-to-5 where they can't check this website has a distinct disadvantage compared to me, especially when some states require Landlords to review applications in order!
Tech is not always an equalizing force, and neither is AI. It's helped me create this tool with very, very little effort because I'm already familiar with the tech involved and how to run it cheaply...I'm skeptical that someone with no experience could do this as easily or would know exactly how to query the AI to give them the right strategy.