Enhance Data Accuracy: Store Multiple Property References
Hey guys, let's dive into a cool way to make our data even better! We're talking about how we handle property references in our system. Right now, when we find a property that's already in the mix, we just drop the new one. This means we miss out on valuable info like the source URL and any supporting quotes. That's a bummer, right? So, we're going to change things up and store multiple references per property. This is all about boosting data quality, making things more transparent, and playing nice with tools like Wikidata. Let's break down the problem and the awesome solution!
The Current Problem: Losing Valuable Data
So, what's the deal with the current system? Well, when we're pulling in data during extraction, we sometimes run into properties that are already there. Right now, our system uses Property.should_store() (in poliloom/enrichment.py:974,1016) to decide what to do. If it matches an existing property, it just drops the new one. That's a big deal because we're throwing away important pieces of information. Specifically, we lose out on the source URL where we found the duplicate property. The source URL is super important because it helps us verify the information and understand where it came from. Without it, we're missing a critical piece of the puzzle. We also lose any supporting quotes from that source. Supporting quotes are like little snippets of text that back up the property. They provide context and evidence, making the data more reliable and trustworthy. The loss of source URLs and supporting quotes has some serious downsides, primarily related to data quality and transparency. Think about it: If we don't know where the data came from, or if we can't see the evidence that supports it, we're left with a less complete and less reliable picture. This can make it difficult to trust the data and use it effectively. We're also hindering our ability to submit data to platforms such as Wikidata because those platforms need multiple references.
The Negative Impacts
- Data Quality Issues: Losing source URLs and supporting quotes directly impacts the quality of our data. It makes it harder to verify information and identify potential errors.
- Reduced Transparency: Without the source URL, it's difficult for users to see where a particular piece of data came from. This lack of transparency can erode trust.
- Wikidata Compliance: We are not able to submit multiple references per statement.
Proposed Solution: The PropertyReference Table
Alright, so how do we fix this? The plan is to create a new table called property_references. This table will store multiple references for each property. Check out the SQL code below to see how it's set up:
CREATE TABLE property_references (
id UUID PRIMARY KEY,
property_id UUID REFERENCES properties(id) ON DELETE CASCADE,
archived_page_id UUID REFERENCES archived_pages(id),
references_json JSONB,
supporting_quotes VARCHAR[],
created_at TIMESTAMP
);
This table has a few key components. First, we've got an id that's a unique identifier. Then there's property_id, which links each reference back to a specific property in our properties table. We've also got archived_page_id to connect the reference to an archived version of a webpage (super useful for keeping track of the original source). The references_json field stores all kinds of reference data in a structured JSON format, and the supporting_quotes field holds the quotes that back up the property. Finally, created_at records when the reference was added. Basically, it's a way to keep track of all the different sources and pieces of evidence we have for a given property. This property_references table will enable us to store multiple sources, supporting quotes, and other relevant information for each property. It's a fundamental change that will significantly improve the richness and accuracy of our data.
Breaking Down the Table Components
- id: A unique identifier for each reference.
- property_id: Links to the
propertiestable, associating the reference with a specific property. - archived_page_id: Links to an archived webpage, preserving the original source.
- references_json: Stores reference data in JSON format, allowing for flexible storage of various types of information (e.g., source URL, date of access, author, etc.).
- supporting_quotes: Contains supporting quotes as an array of strings.
- created_at: Timestamp indicating when the reference was added.
Changes We Need to Make
To make this happen, we need to adjust a few things in our system. Here's a quick rundown of the changes required:
- New Model: We need to create a
PropertyReferencemodel to work with theproperty_referencestable. This model will handle the database interactions for the new table. It will include fields to store the property ID, archived page ID, references in JSON format, supporting quotes, and the creation timestamp. The model will also define methods for adding, retrieving, and managing references. - Migration: We'll need to add the new
property_referencestable to our database. This involves creating the table schema with the appropriate fields and relationships. We'll also need to migrate the existingreferences_jsonandsupporting_quotesdata from thepropertiestable to the newproperty_referencestable. find_matching(): Theshould_store()function will be renamed tofind_matching()and will return the existing property if one is found, otherwise, it will returnNone. It will search the database for an existing property that matches the new one.store_extracted_data(): Instead of dropping the new property, we'll add a reference to the existing property if a match is found. This will involve updating theproperty_referencestable with the new source URL, supporting quotes, and any other relevant information.- API Responses: Our API responses need to be updated to include all references per property. This will make the source URLs and supporting quotes visible to users. The API will need to be modified to include the
property_referencesdata in the responses. - Evaluation Logic: We'll need to adjust the evaluation logic to aggregate references when submitting to Wikidata. This will ensure that all sources are included when submitting data to Wikidata.
Implementation Details: From Dropping to Referencing
So, here's how we'll change the code. Instead of simply dropping the new property if it matches an existing one, we'll add a reference to the existing property, capturing all the valuable info from the new source. Instead of just dropping the duplicate, we will update the existing property with a new reference that includes the source URL and supporting quotes from the new source. This approach ensures that we retain all the valuable information and maintain a complete record of the property's provenance.
Before, we had something like this:
if new_property.should_store(db):
db.add(new_property)
# else: dropped entirely
Now, it'll look like this:
existing = Property.find_matching(db, new_property)
if existing:
existing.add_reference(archived_page, quotes) # NEW
else:
db.add(new_property)
See the difference? We now check for an existing property using find_matching(). If a match is found (existing), we add a reference to it using add_reference(). If no match is found, we add the new property as before. This is the core logic that enables us to store multiple references. The code is designed to be efficient and maintain data integrity. The find_matching() function is optimized to quickly search the database for existing properties, and the add_reference() function ensures that all the relevant data is properly stored in the property_references table.
The Awesome Benefits
Okay, so why are we doing all this? What's in it for us?
- Better Data Quality: Having multiple independent sources that confirm the same fact is a huge win. It boosts our confidence in the data and makes it more reliable. When we store multiple references per property, we are directly improving the quality of our data. Multiple sources and supporting evidence make the information more trustworthy. This is essential for building a reliable and accurate dataset.
- Transparency is Key: Users can see all the sources and evidence we have for each property. This helps them understand where the data came from and evaluate its accuracy. Transparency is a crucial aspect of building trust with users. When users can see all the sources and evidence, they can make informed decisions. It makes the data more credible and useful.
- Wikidata Compliance: We're making it possible to submit multiple references per statement, which is essential for working with platforms like Wikidata. By enabling multiple references, we are ensuring our data is compatible with widely used knowledge bases. This increases the interoperability and usability of our data.
- Audit Trail: We're creating a clear audit trail that shows where each piece of data came from. This is super helpful for tracking down any issues and understanding the history of the data. Having an audit trail helps us track the origin and evolution of the data. This is crucial for debugging errors, verifying information, and maintaining data integrity. It enhances the reliability and trustworthiness of our dataset.
Wrapping Up
By implementing this change, we're making a big step toward better data quality, increased transparency, and easier integration with tools like Wikidata. It's all about making our data more reliable, trustworthy, and useful. This enhancement isn't just a technical upgrade; it's a commitment to providing high-quality, reliable data for everyone. We're investing in a more robust and user-friendly system, ensuring our data remains a valuable resource for all.
So, that's the plan! Let me know what you think, guys. I hope this helps you understand the changes we're making and why they're important! Let's get to work and make our data even more amazing!