This repository contains comprehensive implementations for evaluating advanced measurement techniques in word embeddings, with a focus on developing transformation-invariant metrics for word analogy tasks. The project extends traditional Word2Vec evaluation by implementing and comparing multiple similarity measures, including novel Mahalanobis-based distance metrics.
Meaningful-Word2Vec-Measurements/
├── OLD_Implementation/
│ ├── new_wordvec_measurements.py
│ ├── convert_gz_to_csv.py
│ └── Results_Per_Section/
├── NEW_Implementation/
│ ├── Covariance_Inverse/
│ └── Outer_Correlation/
├── Helper_Files/
│ ├── OLD_Implementation_Helper/
│ └── NEW_Implementation/
│ ├── Covariance_Inverse/
│ └── Outer_Correlation/
└── README.md
Contains the foundational implementation based on replicating and extending Mikolov's Word2Vec methodology. This serves as the baseline for comparison and includes:
- Word analogy evaluation framework
- Traditional and Mahalanobis-based similarity measures
- Compressed result storage system
- Data decompression utilities for exploration
Houses advanced implementations that build upon the baseline methods:
- Covariance_Inverse/: Enhanced covariance matrix inversion techniques with improved numerical stability
- Outer_Correlation/: Novel outer product correlation methods for embedding similarity
Supporting utilities and documentation for both implementation approaches:
- CSV merging and manipulation tools
- Result visualization and display utilities
- Chunk processing methods for large-scale evaluations
- Documentation and analysis notebooks
- Multiple Similarity Measures: Implements naive cosine, Euclidean distance, and three Mahalanobis-based metrics
- Transformation-Invariant Metrics: Addresses coordinate system dependencies in embedding measurements
- Comprehensive Evaluation: Tests across 14 different analogy categories
- Scalable Processing: Handles large vocabulary subsets with configurable parameters
- Compressed Storage: Efficient .gz compression for result files
- Extensive Utilities: Helper scripts for data manipulation and analysis
Install the required dependencies:
pip install gensim numpy pandas
pip install ace_tools_open- Navigate to the baseline implementation:
cd OLD_Implementation/- Run the main evaluation script:
python new_wordvec_measurements.pyThis will:
- Download the Google News Word2Vec model (if not present)
- Evaluate analogies across all categories
- Save compressed results to Results_Per_Section/
- Decompress results for exploration:
python convert_gz_to_csv.pyThis creates the Unzipped_Results_Per_Section/ directory with readable CSV files.
After running the decompression utility, you can explore the results:
import pandas as pd
# Load results for a specific category
df = pd.read_csv('Unzipped_Results_Per_Section/family_results.csv')
# Analyze accuracy by measure type
accuracy_by_measure = df.groupby('measure')['overall_accuracy'].mean()
print(accuracy_by_measure)The convert_gz_to_csv.py utility is essential for data exploration as it decompresses the gzipped CSV files generated by the main evaluation script.
python convert_gz_to_csv.py# Show what files would be processed (dry run)
python convert_gz_to_csv.py --dry-run
# Use custom source and output directories
python convert_gz_to_csv.py --source "My_Results" --output "My_Unzipped_Results"
# List contents of source directory
python convert_gz_to_csv.py --list-source
# List contents of output directory
python convert_gz_to_csv.py --list-outputimport convert_gz_to_csv as gzip_csv
# See what gzipped files are available
gzip_csv.list_directory_contents("Results_Per_Section")
# Test what would happen without actually decompressing
gzip_csv.process_gzipped_files(dry_run=True)
# Actually decompress the files
gzip_csv.process_gzipped_files()
# Test that a specific decompressed file works correctly
gzip_csv.test_csv_file("family_results.csv")- Verification: The tool verifies that decompressed files match the original compressed content exactly
- Safety: Only creates CSV files after successful verification
- Convenience: Handles entire directories of compressed files automatically
- Error Handling: Provides clear feedback if any files fail to decompress properly
- Naive Cosine Similarity: Traditional cosine similarity
- Euclidean Distance: L2 norm distance metric
- Mahalanobis Cosine: Cosine similarity with covariance normalization
- Mahalanobis Shifted Cosine: Cosine similarity with mean-shifted vectors and covariance normalization
- Mahalanobis Distance: Distance metric using inverse covariance matrix
Each CSV file contains detailed analogy evaluation results:
| Column | Description |
|---|---|
| word1, word2, word3, true_word | Analogy components (word1:word2 :: word3:true_word) |
| candidate_1 to candidate_10 | Top 10 predicted words |
| freq_subset | Number of frequent words used for covariance computation |
| rcond | Regularization parameter for pseudo-inverse |
| measure | Similarity measure used |
| overall_accuracy | Accuracy score for this configuration |
For advanced implementations and additional utilities:
- Covariance-based methods: Explore NEW_Implementation/Covariance_Inverse/
- Correlation methods: Check NEW_Implementation/Outer_Correlation/
- Data processing utilities: See Helper_Files/OLD_Implementation_Helper/ for:
- CSV merging across categories
- Result visualization tools
- Chunk-based processing methods
- Performance analysis scripts
The framework evaluates across these categories:
Semantic Categories:
- capital-common-countries
- capital-world
- currency
- city-in-state
- family
Syntactic Categories:
- gram1-adjective-to-adverb
- gram2-opposite
- gram3-comparative
- gram4-superlative
- gram5-present-participle
- gram6-nationality-adjective
- gram7-past-tense
- gram8-plural
- gram9-plural-verbs
Modify evaluation parameters in new_wordvec_measurements.py:
freq_subs = [10000, 20000, 30000, 50000, 100000] # Vocabulary subset sizes
rc_vals = [0.001, 0.01, 0.02, 0.005] # Regularization values
meas_types = ["naive_cosine", "mahalanobis_cosine", ...] # Measures to testThe Helper_Files directory contains additional utilities organized by implementation type:
Contains utilities for working with the baseline implementation results:
- CSV merging tools: Combine results across different analogy categories
- Data visualization scripts: Generate plots and charts for result analysis
- Chunk processing utilities: Handle large-scale data processing efficiently
- Statistical analysis tools: Compute significance tests and confidence intervals
Contains utilities for advanced implementation methods:
- Covariance_Inverse/: Tools for enhanced covariance matrix methods
- Outer_Correlation/: Utilities for outer product correlation approaches
Here's a typical workflow for using this repository:
- Run baseline evaluation:
cd OLD_Implementation/
python new_wordvec_measurements.py- Decompress results for analysis:
python convert_gz_to_csv.py- Explore specific category results:
import pandas as pd
df = pd.read_csv('Unzipped_Results_Per_Section/family_results.csv')
best_config = df.loc[df['overall_accuracy'].idxmax()]
print(f"Best configuration: {best_config['measure']} with accuracy {best_config['overall_accuracy']:.4f}")- Use helper utilities for advanced analysis:
cd ../Helper_Files/OLD_Implementation_Helper/
# Run additional analysis scriptsThis work extends the methodology from:
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
When contributing to this repository, please ensure:
- New implementations follow the directory structure conventions
- All utilities include proper documentation and examples
- Results are properly compressed using the established .gz format
- Helper scripts are placed in the appropriate Helper_Files subdirectory
For detailed implementation notes, additional utilities, and advanced methods, please refer to the respective subdirectories and their documentation. The Helper_Files directories contain extensive utilities for data manipulation, visualization, and analysis that complement the main evaluation frameworks.